[jira] [Commented] (PARQUET-344) Limit the number of rows per block and per split

Daniel Weeks (JIRA) Mon, 24 Aug 2015 15:45:07 -0700

    [ 
https://issues.apache.org/jira/browse/PARQUET-344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14710193#comment-14710193
 ]


Daniel Weeks commented on PARQUET-344:
--------------------------------------

For bullet #1, I could see it being abused if set globally to something small, 
but pretty much any setting could be abused this way.
For bullet #3, At least if you cap the number of rows per row group during 
write, you can adjust the parallelism on the read side (many tasks may end up 
processing the same file).  This doesn't require reading the footer in the 
split calculation (client side), but does require tuning of the split size.

What might make more sense in lieu of number of rows (bullet #2) is a raw data 
size limit per row group (prior to encoding/compression).  At least then you 
are capping the maximum amount of data a task would need to process, which is 
really what we're trying to do.  The raw size will somewhat normalize the 
effectiveness of the encodings and compression algorithms.  Obviously, this 
will still be reduced by column projection, but that's no different than what 
we do now (with the exception of client side metadata based split calculations 
like available in pig).  

For me, this makes more sense than number of rows.

> Limit the number of rows per block and per split
> ------------------------------------------------
>
>                 Key: PARQUET-344
>                 URL: https://issues.apache.org/jira/browse/PARQUET-344
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Quentin Francois
>
> We use Parquet to store raw metrics data and then query this data with 
> Hadoop-Pig. 
> The issue is that sometimes we end up with small Parquet files (~80mo) that 
> contain more than 300 000 000 rows, usually because of a constant metric 
> which results in a very good compression. Too good. As a result we have a 
> very few number of maps that process up to 10x more rows than the other maps 
> and we lose the benefits of the parallelization. 
> The fix for that has two components I believe:
> 1. Be able to limit the number of rows per Parquet block (in addition to the 
> size limit).
> 2. Be able to limit the number of rows per split.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (PARQUET-344) Limit the number of rows per block and per split

Reply via email to