@zivanfi, @gszadovszky, what is this fixing? I don't think that the description -- that there could be multiple row groups per HDFS block -- is a bad thing. Row group size ends up being correlated with memory consumption, so it's reasonable to want to use several row groups in the same HDFS block. Smaller row groups allow fine-grained splitting if you have the parallelism and less memory consumption even if your tasks are a full block.
Also, when I looked through this I saw a bunch of changes to memory management and dictionary pages. How is that related to row group and block alignment? [ Full content available at: https://github.com/apache/parquet-mr/pull/523 ] This message was relayed via gitbox.apache.org for [email protected]
