[ 
https://issues.apache.org/jira/browse/ORC-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111756#comment-16111756
 ] 

Shardul Mahadik commented on ORC-220:
-------------------------------------

I have uploaded the files to reproduce a similar scenario to [this 
gist|https://gist.github.com/shardulm94/718ab21d5e1e150924529d4f2359a1b4]. You 
can see from the orc metadata dump of the orc files created using hive, the 
first few stripes had 2m+ rows, after that, the number of rows started 
decreasing and eventually produced 220+ 5k row stripes before resetting.

{noformat}
Stripe# NumRows DataLen IndexLen FooterLen  NumStreams
1       2405000 5798154 264275  637     289
2       2380000 5706299 260115  639     289
3       2290000 5474255 248562  639     289
4       2380000 5812073 259139  639     289
5       2380000 5706299 260115  639     289
6       2290000 5474255 248562  639     289
7       2380000 5812073 259139  639     289
8       2380000 5706299 260115  639     289
9       2290000 5474255 248562  639     289
10      2380000 5812073 259139  639     289
.
.
.
60      275000  840168  47795   615     289
61      250000  937725  41872   613     289
62      230000  832381  39600   616     289
63      205000  700129  39505   616     289
64      185000  660683  36812   616     289
65      165000  611735  35006   616     289
.
.
.
81      5000    45814   4745    574     289
82      5000    45814   4745    574     289
83      5000    45692   4745    574     289
84      5000    45814   4745    574     289
85      5000    45570   4745    573     289
.
.
.
304     5000    45814   4745    574     289
305     5000    45570   4745    573     289
306     5000    45692   4745    574     289
307     2265000 5613012 250762  638     289
308     2445000 6021628 269308  637     289
309     2380000 5706299 260115  639     289
{noformat}



> Stripe size too small for wide tables
> -------------------------------------
>
>                 Key: ORC-220
>                 URL: https://issues.apache.org/jira/browse/ORC-220
>             Project: ORC
>          Issue Type: Bug
>    Affects Versions: 1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0
>            Reporter: Shardul Mahadik
>
> For a wide table having, eg. 100 columns, I observed that really small 
> stripes were generated.
> As an example, for a table with 133 columns, Stripe Size=128MB with ZLIB, 
> Hive 1.1 generated 35k stripes of 0.03MB; with Hive 2 the situation bettered 
> with 1.2k stripes of 0.8MB (Mostly because Hive 2 selected 64KB compression 
> buffer size instead of the specified 256KB).
> I came across this PR https://github.com/apache/hive/pull/118 which was sent 
> to the Hive repo. The PR suggests using ByteBuffer.postion() instead of 
> ByteBuffer.capacity() to estimate the stripe size. This is really useful for 
> wide tables where the difference between position and capacity of the buffers 
> can add up significantly. In our case, with this patch, I saw that the number 
> of stripes went down to 115, each stripe being 8.3MB. The patch reduced the 
> value returned by estimateStripeSize() by approx 15MB which delayed the 
> flushing on the stripes.
> Would like to know your thoughts on this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to