[PR] Fix handling of nested leaf columns in parallel parquet writer [arrow-datafusion]

via GitHub Sat, 20 Jan 2024 06:11:35 -0800


devinjdangelo opened a new pull request, #8923:
URL: https://github.com/apache/arrow-datafusion/pull/8923


   ## Which issue does this PR close?
   
   Closes #8851
   Closes #8853 
   
   ## Rationale for this change
   
   See issues above. Parallel parquet writer causes various errors/panics when 
used with nested columns.
   
   ## What changes are included in this PR?
   
   I identified the issue in this function which is supposed to send the 
appropriate arrow arrays to the correct column serialization workers:
   
   
https://github.com/apache/arrow-datafusion/blob/95e739cb605307d3337c54ef3f0ab8c72cca5717/datafusion/core/src/datasource/file_format/parquet.rs#L883-L902
   
   The **outer** loop iterates over the "col_array_channels". This works when 
there are no nested columns (i.e. the inner loop only ever iterates once), but 
it is incorrect when there are nested columns. 
   
   The varying errors reported are explained by this bug since a few different 
things can go wrong here:
   
   - The wrong array of the wrong type is sent to a column serializer 
   - The same column serializer is sent too many rows
   - A given column serializer is sent zero rows
   
   This PR fixes this function so that it properly sends nested columns to the 
correct column serializer.
   
   ## Are these changes tested?
   
   Yes, copy.slt now includes tests with various column types at various levels 
of nesting with structs and arrays
   
   ## Are there any user-facing changes?
   
   No


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] Fix handling of nested leaf columns in parallel parquet writer [arrow-datafusion]

Reply via email to