[ 
https://issues.apache.org/jira/browse/HIVE-19206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16460293#comment-16460293
 ] 

Prasanth Jayachandran commented on HIVE-19206:
----------------------------------------------

Rebased patch.

> Automatic memory management for open streaming writers
> ------------------------------------------------------
>
>                 Key: HIVE-19206
>                 URL: https://issues.apache.org/jira/browse/HIVE-19206
>             Project: Hive
>          Issue Type: Sub-task
>          Components: Streaming
>    Affects Versions: 3.0.0, 3.1.0
>            Reporter: Prasanth Jayachandran
>            Assignee: Prasanth Jayachandran
>            Priority: Major
>         Attachments: HIVE-19206.1.patch, HIVE-19206.2.patch, 
> HIVE-19206.3.patch
>
>
> Problem:
>  When there are 100s of record updaters open, the amount of memory required 
> by orc writers keeps growing because of ORC's internal buffers. This can lead 
> to potential high GC or OOM during streaming ingest.
> Solution:
>  The high level idea is for the streaming connection to remember all the open 
> record updaters and flush the record updater periodically (at some interval). 
> Records written to each record updater can be used as a metric to determine 
> the candidate record updaters for flushing. 
>  If stripe size of orc file is 64MB, the default memory management check 
> happens only after every 5000 rows which may which may be too late when there 
> are too many concurrent writers in a process. Example case would be 100 
> writers open and each of them have almost full stripe of 64MB buffered data, 
> this would take 100*64MB ~=6GB of memory. When all of the record writers 
> flush, the memory usage drops down to 100*~2MB which is just ~200MB memory 
> usage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to