aokolnychyi opened a new pull request #2945: URL: https://github.com/apache/iceberg/pull/2945
This PR adds new writer interfaces in `core` and an example of how they can be consumed in Spark 3. This will allow to write position deletes as well as write deltas in Spark. One of the major design changes is using composition over inheritance. ### Writer The first major proposed API is the `Writer` interface that defines a contract for writing a number of files of a single type within one spec/partition. Existing `DataWriter`, `EqualityDeleteWriter`, `PositionDeleteWriter` classes are the simplest implementations of that API. Then we have `RollingWriter` that implements `Writer` and wraps another writer to split the incoming records into multiple files within one spec/partition. We have `RollingDataWriter`, `RollingEqualityDeleteWriter`, `RollingPositionDeleteWriter` as actual implementations. ### PartitionAwareWriter All `Writer` implementations are limited to writing to a single spec/partition. To support writes to multiple specs and partitions, we have `PartitionAwareWriter`. In Iceberg, we support two types of writes: fanout and clustered. That’s why I am proposing to add `ClusteredWriter` and `FanoutWriter`. On one hand, `ClusteredWriter` will write to multiple specs and partitions ensuring the incoming data is properly clustered. On the other hand, `FanoutWriter` will keep a number of writers open and will not require a particular order of data. `ClusteredWriter` is very similar to our existing `PartitionedWriter` but it also detects changes in the spec, not only in partition values. ### V2TaskWriter This PR also introduces a new `TaskWriter` (I call it v2 but we better replace the existing API) and `DeltaTaskWriter` interfaces. They will be used by query engine integrations to write data from a single task. One notable difference compared to the existing code, I am using composition instead of inheritance and delegate to `TaskWriter` from query engine sinks. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
