@zivanfi, @gszadovszky, what is this fixing?

I don't think that the description -- that there could be multiple row groups 
per HDFS block -- is a bad thing. Row group size ends up being correlated with 
memory consumption, so it's reasonable to want to use several row groups in the 
same HDFS block. Smaller row groups allow fine-grained splitting if you have 
the parallelism and less memory consumption even if your tasks are a full block.

Also, when I looked through this I saw a bunch of changes to memory management 
and dictionary pages. How is that related to row group and block alignment?

[ Full content available at: https://github.com/apache/parquet-mr/pull/523 ]
This message was relayed via gitbox.apache.org for [email protected]

Reply via email to