Shardul Mahadik created ORC-220:
-----------------------------------
Summary: Stripe size too small for wide tables
Key: ORC-220
URL: https://issues.apache.org/jira/browse/ORC-220
Project: ORC
Issue Type: Bug
Affects Versions: 1.4.0, 1.3.0, 1.2.0, 1.1.0, 1.0.0
Reporter: Shardul Mahadik
For a wide table having, eg. 100 columns, I observed that really small stripes
were generated.
As an example, for a table with 133 columns, Stripe Size=128MB with ZLIB, Hive
1.1 generated 35k stripes of 0.03MB; with Hive 2 the situation bettered with
1.2k stripes of 0.8MB (Mostly because Hive 2 selected 64KB compression buffer
size instead of the specified 256KB).
I came across this PR https://github.com/apache/hive/pull/118 which was sent to
the Hive repo. The PR suggests using ByteBuffer.postion() instead of
ByteBuffer.capacity() to estimate the stripe size. This is really useful for
wide tables where the difference between position and capacity of the buffers
can add up significantly. In our case, with this patch, I saw that the number
of stripes went down to 115, each stripe being 8.3MB. The patch reduced the
value returned by estimateStripeSize() by approx 15MB which delayed the
flushing on the stripes.
Would like to know your thoughts on this.
--
This message was sent by Atlassian JIRA
(v6.4.14#64029)