Shardul Mahadik created ORC-220:
-----------------------------------

             Summary: Stripe size too small for wide tables
                 Key: ORC-220
                 URL: https://issues.apache.org/jira/browse/ORC-220
             Project: ORC
          Issue Type: Bug
    Affects Versions: 1.4.0, 1.3.0, 1.2.0, 1.1.0, 1.0.0
            Reporter: Shardul Mahadik


For a wide table having, eg. 100 columns, I observed that really small stripes 
were generated.
As an example, for a table with 133 columns, Stripe Size=128MB with ZLIB, Hive 
1.1 generated 35k stripes of 0.03MB; with Hive 2 the situation bettered with 
1.2k stripes of 0.8MB (Mostly because Hive 2 selected 64KB compression buffer 
size instead of the specified 256KB).
I came across this PR https://github.com/apache/hive/pull/118 which was sent to 
the Hive repo. The PR suggests using ByteBuffer.postion() instead of 
ByteBuffer.capacity() to estimate the stripe size. This is really useful for 
wide tables where the difference between position and capacity of the buffers 
can add up significantly. In our case, with this patch, I saw that the number 
of stripes went down to 115, each stripe being 8.3MB. The patch reduced the 
value returned by estimateStripeSize() by approx 15MB which delayed the 
flushing on the stripes.
Would like to know your thoughts on this.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to