[
https://issues.apache.org/jira/browse/ORC-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16108038#comment-16108038
]
Shardul Mahadik commented on ORC-220:
-------------------------------------
I am creating a table using Hive with the table properties for stripe size and
buffer size set. Then I run a hive query to load data from one table to
another. I am not using dynamic partitioning.
In this case, I would assume I am not hitting the memory limit as the scale for
the memory manager does not change. Yet, with Hive 1.1 all the stripes
generated had only 5k rows. With Hive 2, which sets the buffer size to 64kb
[HIVE-11807], I could see the initial 10-15 stripes being generated properly;
after that, as it approached HDFS block boundary, it generated 400+ 5k row
stripes; this was because even though estimateStripeSize indicated that 128MB
stripe size limit was reached, the actual data in the buffers was much less,
and when flushed resulted in a 28kb stripe, and it took a lot of such small
stripes to reach to max padding ratio threshold and reset the stripe size while
starting a new HDFS block.
> Stripe size too small for wide tables
> -------------------------------------
>
> Key: ORC-220
> URL: https://issues.apache.org/jira/browse/ORC-220
> Project: ORC
> Issue Type: Bug
> Affects Versions: 1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0
> Reporter: Shardul Mahadik
>
> For a wide table having, eg. 100 columns, I observed that really small
> stripes were generated.
> As an example, for a table with 133 columns, Stripe Size=128MB with ZLIB,
> Hive 1.1 generated 35k stripes of 0.03MB; with Hive 2 the situation bettered
> with 1.2k stripes of 0.8MB (Mostly because Hive 2 selected 64KB compression
> buffer size instead of the specified 256KB).
> I came across this PR https://github.com/apache/hive/pull/118 which was sent
> to the Hive repo. The PR suggests using ByteBuffer.postion() instead of
> ByteBuffer.capacity() to estimate the stripe size. This is really useful for
> wide tables where the difference between position and capacity of the buffers
> can add up significantly. In our case, with this patch, I saw that the number
> of stripes went down to 115, each stripe being 8.3MB. The patch reduced the
> value returned by estimateStripeSize() by approx 15MB which delayed the
> flushing on the stripes.
> Would like to know your thoughts on this.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)