alamb commented on issue #6339:
URL: 
https://github.com/apache/arrow-datafusion/issues/6339#issuecomment-1546872776

   > So we have the "one-exec-for-each-provider" pattern on the read side, but 
we have a "single-exec-across-all-providers" pattern on the write side? Am I 
misunderstanding this? If this is indeed the case, what is the motivation 
and/or justification behind this asymmetry?
   
   The asymmetry is a good question @ozankabak 
   
   My thinking is
   1. The amount of common, replicated code, between inserts is substantial  
(when prototyping https://github.com/apache/arrow-datafusion/pull/6313 I had to 
make an execution plan that was 90% the same as MemoryWriteExec)
   2. The semantics of COPY or Insert are typically to return the number of 
rows written, not something that depends on the type of table being written 
into. 
   3. I think long term splitting out the physical plans and datasources (e.g. 
#1754 ) will make the codebase easier to work with and keep separation cleaner, 
so connecting datasource to physical plan more tightly worked in opposite 
directions
   
    I think we could achieve most of the above by keeping the same 
`insert_into` that returns `ExecutionPlan` in `TableProvider` and refactoring 
the implementation. 
   
   However, it seemed like most of the flexibility gained by using an 
ExecutionPlan would only server to be more confusing
   
   > TableProvider and return a ParquetExec with the necessary files in the 
scan method. So I don't actually have to implement an Exec myself.
   
   IOx follows this strategy as well
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to