alamb commented on issue #6339: URL: https://github.com/apache/arrow-datafusion/issues/6339#issuecomment-1545953052
> I think you could do something where the last write_stream to complete flushes, but it is a bit odd for sure I think it would be a common pattern (where you either want to commit all the partitions or none, rather than have some of them possibly complete and some fail). 🤔 > This does lead to one question though, what actually is the meaning of partitioning in this context? It allows the Sink to request data be split according to one of the following options: https://docs.rs/datafusion/latest/datafusion/physical_plan/enum.Partitioning.html And let the DataFusion optimizer do its thing to optimize its calculation. > The formulation of Distribution I don't believe is sufficient to express the partitioning that something like a Iceberg or Deltalake would require. Can you please be more specific about what type of partitioning would be required? Are you thinking of "partition by value in a column (like the date)"? > What does exposing the partitioning yield over something simple like It makes it easier to implement `DataSink` as you don't have to worry about the details of ExecutionPlans and connecting things up. Also, if we use partitioning then the calculation can be exposed to the rest of datafusion over time (e.g. take advantage of partitioned grouping) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
