[
https://issues.apache.org/jira/browse/ORC-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16108383#comment-16108383
]
Prasanth Jayachandran commented on ORC-220:
-------------------------------------------
[~shardulm] That looks like a bug around stripe size adjustment to meet padding
threshold. Do you have Orc file dump by any chance that you can share or a
repro test case maybe?
> Stripe size too small for wide tables
> -------------------------------------
>
> Key: ORC-220
> URL: https://issues.apache.org/jira/browse/ORC-220
> Project: ORC
> Issue Type: Bug
> Affects Versions: 1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0
> Reporter: Shardul Mahadik
>
> For a wide table having, eg. 100 columns, I observed that really small
> stripes were generated.
> As an example, for a table with 133 columns, Stripe Size=128MB with ZLIB,
> Hive 1.1 generated 35k stripes of 0.03MB; with Hive 2 the situation bettered
> with 1.2k stripes of 0.8MB (Mostly because Hive 2 selected 64KB compression
> buffer size instead of the specified 256KB).
> I came across this PR https://github.com/apache/hive/pull/118 which was sent
> to the Hive repo. The PR suggests using ByteBuffer.postion() instead of
> ByteBuffer.capacity() to estimate the stripe size. This is really useful for
> wide tables where the difference between position and capacity of the buffers
> can add up significantly. In our case, with this patch, I saw that the number
> of stripes went down to 115, each stripe being 8.3MB. The patch reduced the
> value returned by estimateStripeSize() by approx 15MB which delayed the
> flushing on the stripes.
> Would like to know your thoughts on this.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)