[ https://issues.apache.org/jira/browse/HIVE-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048531#comment-14048531 ]
Gopal V commented on HIVE-7231: ------------------------------- Tests on 1Tb proving that this does cut down on padding, but it progressively writes smaller and smaller stripes within a block. I saw 12MB, 8Mb stripes being written before the 3.2Mb stripe size trigger sets in and triggers a pad event. {code} Resetting stripe size via (1.0 - 0.000000) * (0.663954 * 66945840) = 44448964 Resetting stripe size via (1.0 - 0.000000) * (0.495154 * 44448964) = 22009074 Resetting stripe size via (1.0 - 0.000000) * (0.358696 * 22009074) = 7894571 Resetting stripe size via (1.0 - 0.000000) * (0.263782 * 7894571) = 2082443 Resetting stripe size via (1.0 - 0.000000) * (0.581675 * 2082443) = 1211304 Resetting stripe size via (1.0 - 0.000000) * (0.814780 * 1211304) = 986946 Resetting stripe size via (1.0 - 0.000000) * (0.772579 * 986946) = 762494 {code} I think I might undo the "as a fraction of stripe size" bit and make sure that the padding amount is a fraction of the HDFS block size for consistent stripe sizes as much as possible. > Improve ORC padding > ------------------- > > Key: HIVE-7231 > URL: https://issues.apache.org/jira/browse/HIVE-7231 > Project: Hive > Issue Type: Improvement > Components: File Formats > Affects Versions: 0.14.0 > Reporter: Prasanth J > Assignee: Prasanth J > Labels: orcfile > Attachments: HIVE-7231.1.patch, HIVE-7231.2.patch, HIVE-7231.3.patch, > HIVE-7231.4.patch, HIVE-7231.5.patch, HIVE-7231.6.patch > > > Current ORC padding is not optimal because of fixed stripe sizes within > block. The padding overhead will be significant in some cases. Also padding > percentage relative to stripe size is not configurable. -- This message was sent by Atlassian JIRA (v6.2#6252)