[ 
https://issues.apache.org/jira/browse/ORC-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16108383#comment-16108383
 ] 

Prasanth Jayachandran commented on ORC-220:
-------------------------------------------

[~shardulm] That looks like a bug around stripe size adjustment to meet padding 
threshold. Do you have Orc file dump by any chance that you can share or a 
repro test case maybe?

> Stripe size too small for wide tables
> -------------------------------------
>
>                 Key: ORC-220
>                 URL: https://issues.apache.org/jira/browse/ORC-220
>             Project: ORC
>          Issue Type: Bug
>    Affects Versions: 1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0
>            Reporter: Shardul Mahadik
>
> For a wide table having, eg. 100 columns, I observed that really small 
> stripes were generated.
> As an example, for a table with 133 columns, Stripe Size=128MB with ZLIB, 
> Hive 1.1 generated 35k stripes of 0.03MB; with Hive 2 the situation bettered 
> with 1.2k stripes of 0.8MB (Mostly because Hive 2 selected 64KB compression 
> buffer size instead of the specified 256KB).
> I came across this PR https://github.com/apache/hive/pull/118 which was sent 
> to the Hive repo. The PR suggests using ByteBuffer.postion() instead of 
> ByteBuffer.capacity() to estimate the stripe size. This is really useful for 
> wide tables where the difference between position and capacity of the buffers 
> can add up significantly. In our case, with this patch, I saw that the number 
> of stripes went down to 115, each stripe being 8.3MB. The patch reduced the 
> value returned by estimateStripeSize() by approx 15MB which delayed the 
> flushing on the stripes.
> Would like to know your thoughts on this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to