Quentin Francois created PARQUET-344:
----------------------------------------

             Summary: Limit the number of rows per block and per split
                 Key: PARQUET-344
                 URL: https://issues.apache.org/jira/browse/PARQUET-344
             Project: Parquet
          Issue Type: Improvement
          Components: parquet-mr
            Reporter: Quentin Francois


We use Parquet to store raw metrics data and then query this data with 
Hadoop-Pig. 

The issue is that sometimes we end up with small Parquet files (~80mo) that 
contain more than 300 000 000 rows, usually because of a constant metric which 
results in a very good compression. Too good. As a result we have a very few 
number of maps that process up to 10x more rows than the other maps and we lose 
the benefits of the parallelization. 

The fix for that has two components I believe:
1. Be able to limit the number of rows per Parquet block (in addition to the 
size limit).
2. Be able to limit the number of rows per split.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to