Ryan Blue created PARQUET-306:
---------------------------------

             Summary: Improve alignment between row groups and HDFS blocks
                 Key: PARQUET-306
                 URL: https://issues.apache.org/jira/browse/PARQUET-306
             Project: Parquet
          Issue Type: Improvement
          Components: parquet-mr
            Reporter: Ryan Blue
            Assignee: Ryan Blue


Row groups should not span HDFS blocks to avoid remote reads. There are 3 
things we can use to avoid this:
1. Set the next row group's size to the remaining bytes in the current HDFS 
block
2. Use HDFS-3689, variable-length HDFS blocks, when available
3. Pad after row groups close to the block boundary to start the next row group 
at the start of the next block



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to