[GitHub] [arrow-datafusion] devinjdangelo opened a new pull request, #7632: Support All Statistics and Enable Backpressure in Parallel Parquet Writer

via GitHub Sat, 23 Sep 2023 05:13:18 -0700


devinjdangelo opened a new pull request, #7632:
URL: https://github.com/apache/arrow-datafusion/pull/7632


   This PR is Draft since it cannot compile without arrow-rs changes: #here 
   
   ## Which issue does this PR close?
   
   Closes #7591 
   Closes #7589
   Related to https://github.com/apache/arrow-rs/issues/1718
   
   ## Rationale for this change
   
   Parallel parquet serialization process implemented in #7562 did not support 
all parquet metadata (indexes/bloom filters) and had no backpressure on 
serialization tasks. This PR aims to address these two deficiencies. 
   
   ## What changes are included in this PR?
   
   - Parallel parquet serialization tasks now use `ArrowRowGroupWriter` 
directly rather than `ArrowWriter`
   - Upstream `arrow-rs` changes filed to make `ArrowRowGroupWriter` public and 
`Send` so it can be used across an `.await`
   - Parquet serialization tasks can be throttled via a bounded channel 
mechanism.
   
   ## Benchmarking Results
   
   The results show the parallel parquet process in this PR is ~10% faster than 
the previous in addition to supporting statistics/bloom filters. 
Channel_limit=N means that the maximum number of parallel parquet serialization 
tasks running at one time is set to N. Surprisingly setting this number low can 
actually increase peak memory usage, which is a surprising result.
   
   See #7562 for benchmarking script used. 
   
   ### Test 1, All 16 Columns, ~3.6GB Parquet File (release build)
   
   #### Execution Time (s)
   Parallelism | Main (single_file_output=false) | Main 
(single_file_output=true) | This PR (single_file_output=true, 
channel_limit=100) | This PR (single_file_output=true, channel_limit=4)
   -- | -- | -- | -- | --
   1 | 22.48 | 22.53 | 21.04 | 22.17
   4 | 12.24 | 14.4 | 12.49 | 12.73
   8 | 10.79 | 12.37 | 10.7 | 11.03
   16 | 10.52 | 12.67 | 10.78 | 10.85
   32 | 10.91 | 12.07 | 10.31 | 10.25
   64 | 10.21 | 12.97 | 12.34 | 11.62
   
   #### Peak Memory Usage (MB)
   Parallelism | Main (single_file_output=false) | Main 
(single_file_output=true) | This PR (single_file_output=true, 
channel_limit=100) | This PR (single_file_output=true, channel_limit=4)
   -- | -- | -- | -- | --
   1 | 1753.3 | 1758 | 1757.1 | 1760
   4 | 2445.4 | 7104 | 5690.7 | 5684.2
   8 | 3387 | 7623.1 | 6642 | 7804
   16 | 5047.6 | 8262.6 | 8151 | 10437.7
   32 | 7683.6 | 7672.6 | 9358 | 11657.5
   64 | 10961.1 | 10370.2 | 10388 | 12898
   
   ### Test 2, Subset of 3 Columns, ~895MB Parquet File (release build)
   
   #### Execution Time (s)
   Parallelism | Main (single_file_output=false) | Main 
(single_file_output=true) | This PR (single_file_output=true, 
channel_limit=100) | This PR (single_file_output=true, channel_limit=4)
   -- | -- | -- | -- | --
   1 | 3.57 | 3.15 | 3.38 | 3.39
   4 | 1.78 | 2.37 | 2.53 | 2.3
   8 | 1.45 | 2.07 | 2.58 | 2.23
   16 | 1.54 | 2.09 | 2.06 | 1.71
   32 | 1.7 | 2.1 | 2.08 | 1.65
   64 | 1.89 | 2.72 | 2.76 | 2.07
   
   #### Peak Memory Consumption (MB)
   Parallelism | Main (single_file_output=false) | Main 
(single_file_output=true) | This PR (single_file_output=true, 
channel_limit=100) | This PR (single_file_output=true, channel_limit=4)
   -- | -- | -- | -- | --
   1 | 450.6 | 451.6 | 448.9 | 449.4
   4 | 584.5 | 1659.1 | 1257.0 | 1284.4
   8 | 759.4 | 1939.7 | 1225.7 | 1339.2
   16 | 1045.8 | 2051.2 | 1359.8 | 1438.5
   32 | 1564.6 | 1899.7 | 1445.5 | 1545.3
   64 | 2318.8 | 1726.1 | 1732.0 | 1735.3
   
   ## Are these changes tested?
   
   Yes, by existing tests and adhoc benchmarking
   
   ## Are there any user-facing changes?
   
   No
   
   ## Next Steps
   
   So far, parquet serialization is only being parallelized in terms of 
RowGroups. This means we are limited in terms of parallelization based on the 
number of RowGroups we want in our file, which can be as low as 1 in general. 
Parquet files generally have a large number of columns and we could parallelize 
at the column level in addition to speed up more.
   
   We could also break free of the Parallelism=RowGroupNumber limit if it were 
possible to concatenate (`ArrowColumnChunk`, `ColumnCloseResult`) tuples 
together before writing them into a RowGroup. This might not be possible 
efficiently, since `ArrowColumnChunk`'s are already compressed. Aggregating 
column min/max statistics would be trivial, but distinct values + bloom filters 
would not be trivial.
   ```


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] devinjdangelo opened a new pull request, #7632: Support All Statistics and Enable Backpressure in Parallel Parquet Writer

Reply via email to