Hello Parquet-eers,
I am studying Parquet behavior in terms of Rowgroup to hdfs block mapping, and 
I found some unexpected behavior. (at least I did not expect that ☺).
Here is a print of the layout of a parquet file with 12 rowgroups on hdfs with 
block sizes of 134217728 and rowgroup size set to 134217728 at time of writing 
using Hive.

offset               RG size            Offset + RG size                        
end of hdfs block
4               141389243       141389247                       134217728
141389247       129560117       270949364                       268435456
270949364       137647948       408597312                       402653184
408597312       136785886       545383198                       536870912
545383198       124824992       670208190                       671088640
671088640       139463692       810552332 ->alignment           805306368
810552332       137161048       947713380                       939524096
947713380       128972798       1076686178                      1073741824
1076686178      138875458       1215561636                      1207959552
1215561636      128142960       1343704596                      1342177280
1343704596      138192915       1481897511                      1476395008
1481897511      1149147 1483046658                      1610612736

Ideally, we would want each Rowgroup on one and only one hdfs block. So I was 
expecting to see each rowgroup being a little less than 134217728 in size and 
fit into a single hdfs block and then padded to end of hdfs block so that next 
rowgroup starts on next block.
But what I see is that many rowgroup are actually bigger than 134217728, and 
there is only one instance of padding behavior to realign rowgroup to hdfsblock 
boundary (see where I tagged alignment above).
And even after this realignment, the next rowgroup size is still higher than 
134217728, making again the following rowgroup sit on 2 blocks. So basically in 
this exemple, all rowgroups are sitting on 2 blocks, even if the user (me) 
intention is to have each rowgroup on on hdfs block (hence making rowgroup size 
and hdfs block size equal).

So question: Is there any attempt in Parquet format to achieve Rowgroup to hdfs 
block optimization, so that each rowgroup sit in one and only one hdfs block 
(like ORC stripe padding implemented in Hive 0.12)?
If yes, am I configuring something wrong to get the desired behavior?

Thanks in advance for the help,
Eric Owhadi

Reply via email to