[GitHub] sachouche commented on issue #1420: Drill 6664: Limit the maximum parquet reader batch rows to 64k

GitBox Fri, 03 Aug 2018 16:34:53 -0700

sachouche commented on issue #1420: Drill 6664: Limit the maximum parquet 
reader batch rows to 64k
URL: https://github.com/apache/drill/pull/1420#issuecomment-410401640
 
 
   *What Parquet Used to do*
   - The parquet reader used to hardcode this information along with a comment:
   DEFAULT_RECORDS_TO_READ_IF_FIXED_WIDTH = 64*1024 - 1; // 64K - 1, max SV2 
can address
   - Unfortunately, the SelectionVector2 only specifies "64k" as part of a 
comment (no mention of 64k-1)
   
   *Memory Optimization*
   - The memory optimization for a VL column can save 64k memory + avoids a 
reset and copy of the offset vector
   - Fixed length columns would wait few bytes of space since the last entry is 
unoccupied
   - Is this optimization super important? I would say no, as the current 
default batch memory is set to 16MB; in practice record batches will have less 
than 64k rows when VL columns are involved.
   
   *Conclusion*
   - I like @paul-rogers suggestion to use the ValueVector.MAX_ROW_COUNT as it 
satisfies the goal of this JIRA (64k) and brings us one step further to 
standardization. I'll update the changes shortly.
   
   Thanks for the feedback!


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

[GitHub] sachouche commented on issue #1420: Drill 6664: Limit the maximum parquet reader batch rows to 64k

Reply via email to