Re: [DISCUSS] Proposal to buffer manifest files before updating manifest-list

Micah Kornfield Fri, 22 Nov 2024 09:56:11 -0800

Would cadding the ability to have a list of manifest lists solve this
problem?  This might be an incremental step to getting to "everything" is a
manifest?


For now I wanted to reuse the existing manifest-list and manifests fields.


Regardless of the outcome, please let's not re-use a field in a way that
will change the semantics of the field this goes against good practices on
forward compatibility.

Cheers,
Micah



On Fri, Nov 22, 2024 at 9:31 AM Jan Kaul <[email protected]>
wrote:

> Thanks for your feedback.
>
> About your concerns Fokko:
>
> 1. Generally the number of manifest files in the manifests field
> shouldn't get too large. But I think you can already improve the write
> amplification and conflict resolution with using up to 10 manifest files.
> The fact that the manifests field only contains paths is not ideal and
> may be a reason to have a separate discussion on a new metadata field.
> However, the writer writing the manifest files could keep some kind of
> cache of the partition values, statistics so that it doesn't need to fetch
> the information when writing the manifest-list. This becomes an issue when
> multiple concurrent writers are at work, because they would still need to
> fetch the information from the files that they didn't write.
> As you mentioned, my approach would be to always include the manifest
> files from the manifests field in the query plan and only prune their
> manifest_entries. I would try to keep the number of manifest files in the
> manifests field small to reduce this effect, but this could definitely be
> a drawback.
>
> 2.  Regarding the sequence-number inheritance, every manifest file in the
> manifests field should inherit the sequence-number from the snapshot that
> contains it. This means that all manifest files in the manifests field
> have the same sequence-number, which limits the capabilities of deletes.
> One could either limit deletes to only reference data files that are are
> already committed to the manifest-list or one might flush the manifest
> files from the manifests field every time a delete file is occurs.
> Essentially disabling the proposed behavior. It would still yield benefits
> for append only tables.
> The conflict resolution should be easier for most scenarios as the
> manifest-list does not need to be rewritten. For appends the new manifests
> field is the union of the manifest files of the conflicting manifests
> fields.
>
> About your concerns Russel:
>
> My motivation was to have a separation between a consolidated and a
> temporary list of manifest files. The contents of the temporary list
> regularly gets moved to the consolidated list. But the fact that the
> temporary list is small, reduces the impact of frequent rewrites and makes
> it easy to use set operations to resolve conflicts. These different lists
> could be stored as two different manifest files that contain other
> manifests or datafiles. For now I wanted to reuse the existing
> manifest-list and manifests fields.
>
> Thanks,
>
> Jan
> On 22.11.24 17:02, Russell Spitzer wrote:
>
> I would much rather we switch to the "everything is a manifest approach.
> Instead of manifest lists we only ever have manifests. A Manifest can then
> link to data files or additional manifests. In the case of streaming then
> you only ever have to read and write a single manifest. If we couple this
> with delete vectors we can greatly reduce the number of writes. I am
> generally against anything that puts additional (unbounded) content into
> the metadata.json. I'm not sure if anyone has written this up as a full
> proposal yet but I know it's been discussed a bunch.
>
> On Fri, Nov 22, 2024 at 9:31 AM Fokko Driesprong <[email protected]> wrote:
>
>> Hi Jan,
>>
>> Thanks for sending out this proposal. While reading through it, two
>> questions pop up:
>>
>>    - You mentioned repurposing the manifests field. Currently, this
>>    field contains a list of paths that point to the manifest data. Would
>>    this also be your suggestion? This way, when committing the accumulated
>>    manifests into a manifest list, you would need to open up all the 
>> manifests
>>    to get information like partition information, statistics, etc. This way
>>    there is also no way to distinguish between data and delete manifests
>>    without having to open the files, effectively always including those files
>>    in the query plan.
>>    - It is unclear to me if appending a manifest to the manifests will
>>    create a new snapshot. I think it should. Either way, I think this
>>    conflicts with the concept of sequence number inheritance
>>    
>> <https://github.com/apache/iceberg/blob/main/format/spec.md#sequence-numbers>.
>>    This is used to avoid having to rewrite a manifest when a conflict occurs,
>>    you only have to rewrite the manifest list. When there is a conflict, the
>>    client that sees the conflict, will take the latest manifest-list, and
>>    inherit in the sequence number. When you can append to the manifest list,
>>    you won't be able to determine which snapshot has added the file. If you
>>    wouldn't use inheritance, then you would need to rewrite the manifest on a
>>    conflict (because the sequence ID has been used already).
>>
>> I have to think a bit more about it but above are my concerns so far.
>>
>> Kind regards,
>> Fokko
>>
>> Op vr 22 nov 2024 om 15:26 schreef Jan Kaul <[email protected]>
>> <[email protected]>:
>>
>>> Hi all,
>>>
>>> I'd like to propose an optimization for how we track manifest files in
>>> Iceberg tables, specifically focusing on reducing write amplification and
>>> simplifying conflict resolution during fast-append operations.
>>> Background: Replace vs. Change-Based Updates
>>>
>>> To frame this proposal, let's first consider two approaches to state
>>> management in table systems:
>>>
>>> 1. Replace-based updates: The entire state is replaced with each update.
>>> This is how Iceberg currently handles manifest files - when new manifests
>>> are added, we create an entirely new snapshot.
>>>
>>> 2. Change-based updates: Only incremental changes are tracked and
>>> replayed to derive the current state. This is similar to how Delta tables
>>> track data files.
>>>
>>> While Iceberg initially used purely replace-based updates, we've already
>>> successfully adopted change-based updates for the top-level table metadata
>>> with the REST catalog. Instead of uploading entire table metadata, we now
>>> only upload new snapshots during update-table operations.
>>>
>>> Proposed Enhancement
>>>
>>> I propose extending this change-based approach to manifest file
>>> tracking, specifically for fast-append operations. Here's how:
>>>
>>> 1. Repurpose the manifests field as a buffer to track new manifest file
>>> additions
>>> 2. Define the complete set of manifest files as the union of:
>>>    - Manifest files from the manifest-list
>>>    - Manifest files from the manifests field
>>>
>>> Implementation Details
>>>
>>> - When performing fast-append operations:
>>>   * New manifest files are added to the manifests field
>>>   * Changes are committed via update-table catalog operation
>>>   * The manifest-list remains unchanged, eliminating write amplification
>>>
>>> - After a configured number of fast-appends:
>>>   * Manifest files are removed from the manifests field
>>>   * Files are consolidated into a new manifest-list
>>>   * The manifest files are assigned the sequence-number of the snapshot
>>> when they are written to the manifest-list
>>> Constraints and Considerations
>>>
>>> For this approach to work effectively, manifest files in the manifests
>>> field must:
>>>    * Contain only data files that are not referenced by other manifests
>>>    * Contain only delete files that reference data files already present
>>> in the manifest-list
>>>
>>> If any of these assumptions is violated, the manifest files from the
>>> manifests field are flushed to the manifest-list and the standard
>>> commit procedure is applied.
>>> Benefits
>>>
>>> - Significantly reduced write amplification for streaming inserts
>>> - Simplifies conflict resolution by the catalog. If two concurrent
>>> writes occur, the entries in the manifests field can simply be merged
>>> together
>>> - Leverages existing Iceberg metadata constructs
>>> - Maintains compatibility with current catalog operations
>>>
>>> Note: While this proposal suggests repurposing the manifests field, we
>>> could alternatively implement this as a new metadata field if preferred.
>>>
>>> I'd appreciate your thoughts on this approach and welcome any feedback
>>> or concerns.
>>>
>>> Best regards,
>>>
>>> Jan
>>>
>>

Re: [DISCUSS] Proposal to buffer manifest files before updating manifest-list

Reply via email to