[ 
https://issues.apache.org/jira/browse/HIVE-7231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14048531#comment-14048531
 ] 

Gopal V commented on HIVE-7231:
-------------------------------

Tests on 1Tb proving that this does cut down on padding, but it progressively 
writes smaller and smaller stripes within a block.

I saw 12MB, 8Mb stripes being written before the 3.2Mb stripe size trigger sets 
in and triggers a pad event.

{code}
Resetting stripe size via (1.0 - 0.000000) * (0.663954 * 66945840) = 44448964
Resetting stripe size via (1.0 - 0.000000) * (0.495154 * 44448964) = 22009074
Resetting stripe size via (1.0 - 0.000000) * (0.358696 * 22009074) = 7894571
Resetting stripe size via (1.0 - 0.000000) * (0.263782 * 7894571) = 2082443
Resetting stripe size via (1.0 - 0.000000) * (0.581675 * 2082443) = 1211304
Resetting stripe size via (1.0 - 0.000000) * (0.814780 * 1211304) = 986946 
Resetting stripe size via (1.0 - 0.000000) * (0.772579 * 986946) = 762494 
{code}

I think I might undo the "as a fraction of stripe size" bit and make sure that 
the padding amount is a fraction of the HDFS block size for consistent stripe 
sizes as much as possible.

> Improve ORC padding
> -------------------
>
>                 Key: HIVE-7231
>                 URL: https://issues.apache.org/jira/browse/HIVE-7231
>             Project: Hive
>          Issue Type: Improvement
>          Components: File Formats
>    Affects Versions: 0.14.0
>            Reporter: Prasanth J
>            Assignee: Prasanth J
>              Labels: orcfile
>         Attachments: HIVE-7231.1.patch, HIVE-7231.2.patch, HIVE-7231.3.patch, 
> HIVE-7231.4.patch, HIVE-7231.5.patch, HIVE-7231.6.patch
>
>
> Current ORC padding is not optimal because of fixed stripe sizes within 
> block. The padding overhead will be significant in some cases. Also padding 
> percentage relative to stripe size is not configurable.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to