[GitHub] [parquet-mr] rdblue commented on issue #523: PARQUET-1337: Current block alignment logic may lead to several row groups per block

GitHub Thu, 27 Sep 2018 16:06:02 -0700

@zivanfi, @gszadovszky, what is this fixing?

I don't think that the description -- that there could be multiple row groups 
per HDFS block -- is a bad thing. Row group size ends up being correlated with 
memory consumption, so it's reasonable to want to use several row groups in the 
same HDFS block. Smaller row groups allow fine-grained splitting if you have 
the parallelism and less memory consumption even if your tasks are a full block.


Also, when I looked through this I saw a bunch of changes to memory management 
and dictionary pages. How is that related to row group and block alignment?

[ Full content available at: https://github.com/apache/parquet-mr/pull/523 ]
This message was relayed via gitbox.apache.org for [email protected]

[GitHub] [parquet-mr] rdblue commented on issue #523: PARQUET-1337: Current block alignment logic may lead to several row groups per block

Reply via email to