[
https://issues.apache.org/jira/browse/ORC-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16111756#comment-16111756
]
Shardul Mahadik commented on ORC-220:
-------------------------------------
I have uploaded the files to reproduce a similar scenario to [this
gist|https://gist.github.com/shardulm94/718ab21d5e1e150924529d4f2359a1b4]. You
can see from the orc metadata dump of the orc files created using hive, the
first few stripes had 2m+ rows, after that, the number of rows started
decreasing and eventually produced 220+ 5k row stripes before resetting.
{noformat}
Stripe# NumRows DataLen IndexLen FooterLen NumStreams
1 2405000 5798154 264275 637 289
2 2380000 5706299 260115 639 289
3 2290000 5474255 248562 639 289
4 2380000 5812073 259139 639 289
5 2380000 5706299 260115 639 289
6 2290000 5474255 248562 639 289
7 2380000 5812073 259139 639 289
8 2380000 5706299 260115 639 289
9 2290000 5474255 248562 639 289
10 2380000 5812073 259139 639 289
.
.
.
60 275000 840168 47795 615 289
61 250000 937725 41872 613 289
62 230000 832381 39600 616 289
63 205000 700129 39505 616 289
64 185000 660683 36812 616 289
65 165000 611735 35006 616 289
.
.
.
81 5000 45814 4745 574 289
82 5000 45814 4745 574 289
83 5000 45692 4745 574 289
84 5000 45814 4745 574 289
85 5000 45570 4745 573 289
.
.
.
304 5000 45814 4745 574 289
305 5000 45570 4745 573 289
306 5000 45692 4745 574 289
307 2265000 5613012 250762 638 289
308 2445000 6021628 269308 637 289
309 2380000 5706299 260115 639 289
{noformat}
> Stripe size too small for wide tables
> -------------------------------------
>
> Key: ORC-220
> URL: https://issues.apache.org/jira/browse/ORC-220
> Project: ORC
> Issue Type: Bug
> Affects Versions: 1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0
> Reporter: Shardul Mahadik
>
> For a wide table having, eg. 100 columns, I observed that really small
> stripes were generated.
> As an example, for a table with 133 columns, Stripe Size=128MB with ZLIB,
> Hive 1.1 generated 35k stripes of 0.03MB; with Hive 2 the situation bettered
> with 1.2k stripes of 0.8MB (Mostly because Hive 2 selected 64KB compression
> buffer size instead of the specified 256KB).
> I came across this PR https://github.com/apache/hive/pull/118 which was sent
> to the Hive repo. The PR suggests using ByteBuffer.postion() instead of
> ByteBuffer.capacity() to estimate the stripe size. This is really useful for
> wide tables where the difference between position and capacity of the buffers
> can add up significantly. In our case, with this patch, I saw that the number
> of stripes went down to 115, each stripe being 8.3MB. The patch reduced the
> value returned by estimateStripeSize() by approx 15MB which delayed the
> flushing on the stripes.
> Would like to know your thoughts on this.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)