[
https://issues.apache.org/jira/browse/HIVE-11807?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14744025#comment-14744025
]
Owen O'Malley commented on HIVE-11807:
--------------------------------------
Ok, there are a couple changes that I'd propose:
* Use the stripe size rather than the available memory. This is more important
because the stripe will be flushed when the buffering reaches the stripe size.
* Count all of the columns not just the top level ones.
* Most of the streams have at most 2 large streams so if we use 20 buffers,
that will give us a reasonable balance between internal fragmentation and
throughput.
> Set ORC buffer size in relation to set stripe size
> --------------------------------------------------
>
> Key: HIVE-11807
> URL: https://issues.apache.org/jira/browse/HIVE-11807
> Project: Hive
> Issue Type: Improvement
> Components: File Formats
> Reporter: Owen O'Malley
> Assignee: Owen O'Malley
>
> A customer produced ORC files with very small stripe sizes (10k rows/stripe)
> by setting a small 64MB stripe size and 256K buffer size for a 54 column
> table. At that size, each of the streams only get a buffer or two before the
> stripe size is reached. The current code uses the available memory instead of
> the stripe size and thus doesn't shrink the buffer size if the JVM has much
> more memory than the stripe size.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)