[GitHub] [arrow] westonpace commented on pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

via GitHub Wed, 01 Mar 2023 09:12:29 -0800


westonpace commented on PR #34281:
URL: https://github.com/apache/arrow/pull/34281#issuecomment-1450516224


   > The challenge there is that you need to align the multiple columns 
somehow, and there is no guarantee that pages are aligned between columns. For 
example, you might have a column that has rows partitioned in 50K pages 
([0..50K][50K..100K]...) and a column that has rows partitioned in 60K pages 
([0..60K][60K..120K]...). Splitting pages over multiple threads doesn't work 
nicely for that reason.
   
   In my head, the way this would be addressed is in stages:
   
    * I/O Stage: Each column reader independently reads ahead 8-16MB worth of 
data pages.  This could be configurable and tweaked and could even change 
dynamically based on how quickly a column is being read.
    * Decoding Stage: Decoding is parallel across columns but has batch 
semantics.  In other words, the "batch decoder" is asked to decode a batch of X 
rows (e.g. 1 million).  Space could be allocated at this point for the output 
batch.  This then delegates to the various column decoders so each column 
decoder is asked to decode X items.  So maybe a column reads 5 pages to fill 
this batch or maybe it reads 50.
   
   I have no idea how close Arrow's parquet-cpp reader is to this 
implementation or whether it is even feasible.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on pull request #34281: GH-34280: [C++][Python] Clarify meaning of row_group_size and change default to 1Mi

Reply via email to