devinjdangelo opened a new pull request, #8923: URL: https://github.com/apache/arrow-datafusion/pull/8923
## Which issue does this PR close? Closes #8851 Closes #8853 ## Rationale for this change See issues above. Parallel parquet writer causes various errors/panics when used with nested columns. ## What changes are included in this PR? I identified the issue in this function which is supposed to send the appropriate arrow arrays to the correct column serialization workers: https://github.com/apache/arrow-datafusion/blob/95e739cb605307d3337c54ef3f0ab8c72cca5717/datafusion/core/src/datasource/file_format/parquet.rs#L883-L902 The **outer** loop iterates over the "col_array_channels". This works when there are no nested columns (i.e. the inner loop only ever iterates once), but it is incorrect when there are nested columns. The varying errors reported are explained by this bug since a few different things can go wrong here: - The wrong array of the wrong type is sent to a column serializer - The same column serializer is sent too many rows - A given column serializer is sent zero rows This PR fixes this function so that it properly sends nested columns to the correct column serializer. ## Are these changes tested? Yes, copy.slt now includes tests with various column types at various levels of nesting with structs and arrays ## Are there any user-facing changes? No -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
