[ 
https://issues.apache.org/jira/browse/ORC-220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16107938#comment-16107938
 ] 

Prasanth Jayachandran commented on ORC-220:
-------------------------------------------

[~shardulm] How are you generating the ORC files? Are you using Hive? The 
stripe size will get affected if there is not much memory available for ORC 
writers. Concurrent writers will share the available memory. For example: If 
you are using dynamic partitioning in Hive then reducers will keep many ORC 
writers open at the same time reducing the stripe size of individual writers. 
You could provide more memory, reduce stripe size or enable 
hive.optimize.sort.dynamic.partition which makes sure only one writer is open 
at a time in case of dynamic partitioning. By default ORC memory manager uses 
only 50% (hive.exec.orc.memory.pool) of heap memory leaving some space of 
aggregation, sort buffers etc.

I don't think using ByteBufer.position() will be correct here as the size has 
to account for memory usage in heap. It doesn't matter if ORC stream uses the 
buffer fully or not, memory manager has to account for total allocation.

> Stripe size too small for wide tables
> -------------------------------------
>
>                 Key: ORC-220
>                 URL: https://issues.apache.org/jira/browse/ORC-220
>             Project: ORC
>          Issue Type: Bug
>    Affects Versions: 1.0.0, 1.1.0, 1.2.0, 1.3.0, 1.4.0
>            Reporter: Shardul Mahadik
>
> For a wide table having, eg. 100 columns, I observed that really small 
> stripes were generated.
> As an example, for a table with 133 columns, Stripe Size=128MB with ZLIB, 
> Hive 1.1 generated 35k stripes of 0.03MB; with Hive 2 the situation bettered 
> with 1.2k stripes of 0.8MB (Mostly because Hive 2 selected 64KB compression 
> buffer size instead of the specified 256KB).
> I came across this PR https://github.com/apache/hive/pull/118 which was sent 
> to the Hive repo. The PR suggests using ByteBuffer.postion() instead of 
> ByteBuffer.capacity() to estimate the stripe size. This is really useful for 
> wide tables where the difference between position and capacity of the buffers 
> can add up significantly. In our case, with this patch, I saw that the number 
> of stripes went down to 115, each stripe being 8.3MB. The patch reduced the 
> value returned by estimateStripeSize() by approx 15MB which delayed the 
> flushing on the stripes.
> Would like to know your thoughts on this.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to