Hello Parquet-eers, I am studying Parquet behavior in terms of Rowgroup to hdfs block mapping, and I found some unexpected behavior. (at least I did not expect that ☺). Here is a print of the layout of a parquet file with 12 rowgroups on hdfs with block sizes of 134217728 and rowgroup size set to 134217728 at time of writing using Hive.
offset RG size Offset + RG size end of hdfs block 4 141389243 141389247 134217728 141389247 129560117 270949364 268435456 270949364 137647948 408597312 402653184 408597312 136785886 545383198 536870912 545383198 124824992 670208190 671088640 671088640 139463692 810552332 ->alignment 805306368 810552332 137161048 947713380 939524096 947713380 128972798 1076686178 1073741824 1076686178 138875458 1215561636 1207959552 1215561636 128142960 1343704596 1342177280 1343704596 138192915 1481897511 1476395008 1481897511 1149147 1483046658 1610612736 Ideally, we would want each Rowgroup on one and only one hdfs block. So I was expecting to see each rowgroup being a little less than 134217728 in size and fit into a single hdfs block and then padded to end of hdfs block so that next rowgroup starts on next block. But what I see is that many rowgroup are actually bigger than 134217728, and there is only one instance of padding behavior to realign rowgroup to hdfsblock boundary (see where I tagged alignment above). And even after this realignment, the next rowgroup size is still higher than 134217728, making again the following rowgroup sit on 2 blocks. So basically in this exemple, all rowgroups are sitting on 2 blocks, even if the user (me) intention is to have each rowgroup on on hdfs block (hence making rowgroup size and hdfs block size equal). So question: Is there any attempt in Parquet format to achieve Rowgroup to hdfs block optimization, so that each rowgroup sit in one and only one hdfs block (like ORC stripe padding implemented in Hive 0.12)? If yes, am I configuring something wrong to get the desired behavior? Thanks in advance for the help, Eric Owhadi
