[
https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526023#comment-14526023
]
Selina Zhang commented on HIVE-10036:
-------------------------------------
[~gopalv] Thank you! I included the io.netty to ql/pom.xml and uploaded a new
patch.
> Writing ORC format big table causes OOM - too many fixed sized stream buffers
> -----------------------------------------------------------------------------
>
> Key: HIVE-10036
> URL: https://issues.apache.org/jira/browse/HIVE-10036
> Project: Hive
> Issue Type: Improvement
> Reporter: Selina Zhang
> Assignee: Selina Zhang
> Labels: orcfile
> Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch,
> HIVE-10036.3.patch, HIVE-10036.5.patch, HIVE-10036.6.patch, HIVE-10036.7.patch
>
>
> ORC writer keeps multiple out steams for each column. Each output stream is
> allocated fixed size ByteBuffer (configurable, default to 256K). For a big
> table, the memory cost is unbearable. Specially when HCatalog dynamic
> partition involves, several hundreds files may be open and writing at the
> same time (same problems for FileSinkOperator).
> Global ORC memory manager controls the buffer size, but it only got kicked in
> at 5000 rows interval. An enhancement could be done here, but the problem is
> reducing the buffer size introduces worse compression and more IOs in read
> path. Sacrificing the read performance is always not a good choice.
> I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound
> to the existing configurable buffer size. Most of the streams does not need
> large buffer so the performance got improved significantly. Comparing to
> Facebook's hive-dwrf, I monitored 2x performance gain with this fix.
> Solving OOM for ORC completely maybe needs lots of effort , but this is
> definitely a low hanging fruit.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)