[ 
https://issues.apache.org/jira/browse/HIVE-10036?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14526018#comment-14526018
 ] 

Selina Zhang commented on HIVE-10036:
-------------------------------------

[~owen.omalley] I agree buffer copies is not the best solution. But given the 
maximum buffer size as default is 256K, comparing to pre-allocate big chunk of 
memory for each out streams, allow copy small buffers is not that bad. And also 
for compression, there is no way to predict the exact compressed size. There 
maybe a solution that define different fixed buffer size for different stream, 
but it is will not handle the sparse case. I think the current solution is 
obvious and an easy fix. It has no conflict with ORC overall design. 

> Writing ORC format big table causes OOM - too many fixed sized stream buffers
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-10036
>                 URL: https://issues.apache.org/jira/browse/HIVE-10036
>             Project: Hive
>          Issue Type: Improvement
>            Reporter: Selina Zhang
>            Assignee: Selina Zhang
>              Labels: orcfile
>         Attachments: HIVE-10036.1.patch, HIVE-10036.2.patch, 
> HIVE-10036.3.patch, HIVE-10036.5.patch, HIVE-10036.6.patch, HIVE-10036.7.patch
>
>
> ORC writer keeps multiple out steams for each column. Each output stream is 
> allocated fixed size ByteBuffer (configurable, default to 256K). For a big 
> table, the memory cost is unbearable. Specially when HCatalog dynamic 
> partition involves, several hundreds files may be open and writing at the 
> same time (same problems for FileSinkOperator). 
> Global ORC memory manager controls the buffer size, but it only got kicked in 
> at 5000 rows interval. An enhancement could be done here, but the problem is 
> reducing the buffer size introduces worse compression and more IOs in read 
> path. Sacrificing the read performance is always not a good choice. 
> I changed the fixed size ByteBuffer to a dynamic growth buffer which up bound 
> to the existing configurable buffer size. Most of the streams does not need 
> large buffer so the performance got improved significantly. Comparing to 
> Facebook's hive-dwrf, I monitored 2x performance gain with this fix. 
> Solving OOM for ORC completely maybe needs lots of effort , but this is 
> definitely a low hanging fruit. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to