[GitHub] [arrow-datafusion] alamb commented on issue #6339: Simplified TableProvider::Insert API

via GitHub Fri, 12 May 2023 08:53:43 -0700


alamb commented on issue #6339:
URL: 
https://github.com/apache/arrow-datafusion/issues/6339#issuecomment-1545953052


   > I think you could do something where the last write_stream to complete 
flushes, but it is a bit odd for sure
   
   I think it would be a common pattern (where you either want to commit all 
the partitions or none, rather than have some of them possibly complete and 
some fail). 🤔 
   
   > This does lead to one question though, what actually is the meaning of 
partitioning in this context? 
   
   It allows the Sink to request data be split according to one of the 
following options:
   
   
https://docs.rs/datafusion/latest/datafusion/physical_plan/enum.Partitioning.html
   
   And let the DataFusion optimizer do its thing to optimize its calculation. 
   
   
   > The formulation of Distribution I don't believe is sufficient to express 
the partitioning that something like a Iceberg or Deltalake would require.
   
   Can you please be more specific about what type of partitioning would be 
required? Are you thinking of "partition by value in a column (like the date)"?
   
   > What does exposing the partitioning yield over something simple like
   
   It makes it easier to implement `DataSink` as you don't have to worry about 
the details of ExecutionPlans and connecting things up. Also, if we use 
partitioning then the calculation can be exposed to the rest of datafusion over 
time (e.g. take advantage of partitioned grouping)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on issue #6339: Simplified TableProvider::Insert API

Reply via email to