devinjdangelo commented on PR #7791: URL: https://github.com/apache/arrow-datafusion/pull/7791#issuecomment-1759617437
> I think we need to add some tests of this functionality -- specifically that setting soft_max_rows_per_output_file and then writing to tables (both empty and appending) that it actually makes the expected number of files Good idea. I adjusted some existing tests to set desired_batch_size and soft_max_rows_per_output_file both to 1. Then we can check that number of written files == number of inserted rows. Tests are passing for all 3 file types inserting to empty and preexisting table. > One concern I have with the approach in this PR that now instead of writing the data in parallel, this PR would write the data serially (in order to preserve the order) resulting in fewer larger files. This is a concern I shared while working on this, but I believe there are a few mitigating factors: 1. I have invested significant effort into parallelizing individual file serialization, which reduces the performance boost to run multiple independent files in parallel. Right now only parquet writes are single threaded until the current draft PR can be merged. 2. It is still possible for multiple files to be worked on in parallel in this PR if buffer sizes are set large enough. This is why I set the defaults so high (just keep buffering until the second, third ect file writer can kick in) > It seems like it could be better to write max_parallel_output_files in parallel (even though they might end with fewer than soft_max_rows_per_output_file rows. I agree there is a middle ground here. I.e. write at least N files, each up to soft_maximum rows. That way we are guaranteed at least N files worked in parallel regardless of buffer size. This will add a good deal of complexity to the demuxer logic, so I'd like to work this as an enhancement in a follow up PR if that is OK with you. > Inserting into an empty directory didn't work for csv or JSON This is because the "append to existing file" mode (default for these file types) does not know how to handle empty tables. We could update the logic so "append to existing file" will create 1 file if and only if the table is completely empty. I don't expect this to be too complex. Current workaround would be to create table with append new file mode, write 1 file, then drop the table and switch to append to existing file mode. Not great UX, so agreed that we should fix. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
