Quentin Francois created PARQUET-344:
----------------------------------------
Summary: Limit the number of rows per block and per split
Key: PARQUET-344
URL: https://issues.apache.org/jira/browse/PARQUET-344
Project: Parquet
Issue Type: Improvement
Components: parquet-mr
Reporter: Quentin Francois
We use Parquet to store raw metrics data and then query this data with
Hadoop-Pig.
The issue is that sometimes we end up with small Parquet files (~80mo) that
contain more than 300 000 000 rows, usually because of a constant metric which
results in a very good compression. Too good. As a result we have a very few
number of maps that process up to 10x more rows than the other maps and we lose
the benefits of the parallelization.
The fix for that has two components I believe:
1. Be able to limit the number of rows per Parquet block (in addition to the
size limit).
2. Be able to limit the number of rows per split.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)