Re: [PR] DataSink Dynamic Execution Time Demux [arrow-datafusion]

via GitHub Thu, 12 Oct 2023 06:30:31 -0700


devinjdangelo commented on PR #7791:
URL: 
https://github.com/apache/arrow-datafusion/pull/7791#issuecomment-1759617437


   > I think we need to add some tests of this functionality -- specifically 
that setting soft_max_rows_per_output_file and then writing to tables (both 
empty and appending) that it actually makes the expected number of files
   
   Good idea. I adjusted some existing tests to set desired_batch_size and 
soft_max_rows_per_output_file both to 1. Then we can check that number of 
written files == number of inserted rows. Tests are passing for all 3 file 
types inserting to empty and preexisting table.
   
   > One concern I have with the approach in this PR that now instead of 
writing the data in parallel, this PR would write the data serially (in order 
to preserve the order) resulting in fewer larger files. 
   
   This is a concern I shared while working on this, but I believe there are a 
few mitigating factors:
   
   1. I have invested significant effort into parallelizing individual file 
serialization, which reduces the performance boost to run multiple independent 
files in parallel. Right now only parquet writes are single threaded until the 
current draft PR can be merged.
   2. It is still possible for multiple files to be worked on in parallel in 
this PR if buffer sizes are set large enough. This is why I set the defaults so 
high (just keep buffering until the second, third ect file writer can kick in)
   
   > It seems like it could be better to write max_parallel_output_files in 
parallel (even though they might end with fewer than 
soft_max_rows_per_output_file rows. 
   
   I agree there is a middle ground here. I.e. write at least N files, each up 
to soft_maximum rows. That way we are guaranteed at least N files worked in 
parallel regardless of buffer size. This will add a good deal of complexity to 
the demuxer logic, so I'd like to work this as an enhancement in a follow up PR 
if that is OK with you.
   
   > Inserting into an empty directory didn't work for csv or JSON
   
   This is because the "append to existing file" mode (default for these file 
types) does not know how to handle empty tables. We could update the logic so 
"append to existing file" will create 1 file if and only if the table is 
completely empty. I don't expect this to be too complex. Current workaround 
would be to create table with append new file mode, write 1 file, then drop the 
table and switch to append to existing file mode. Not great UX, so agreed that 
we should fix.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] DataSink Dynamic Execution Time Demux [arrow-datafusion]

Reply via email to