[ 
https://issues.apache.org/jira/browse/HIVE-4421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13646783#comment-13646783
 ] 

Phabricator commented on HIVE-4421:
-----------------------------------

ashutoshc has requested changes to the revision "HIVE-4421 [jira] Improve 
memory usage by ORC dictionaries".

  Logic in patch mostly looks good. Just requesting for more comments, though 
ORC is already have pretty good comments. Also, I didn't understand changes in 
RedBlackTree. I assume you have improved memory accounting for it. But it will 
be great if you can spell out what was the problem earlier which you are fixing 
in this patch.

INLINE COMMENTS
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java:44 "added 
to.." is repeated.
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java:47 I think 
its better to define this in HiveConf as well, so that we can up or down this 
value without needing to recompile Hive. Specially, since size of row is 
unbounded. Size of 5K rows are very much data dependent. e.g., I recently saw a 
table which had more than 100 string columns.
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java:141 I think 
it will be good to add a note in comment about usage of synchronized keyword, 
ie the scenario where this method might be invoked from different threads.
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java:94 It will be 
good to add a comment on when oldVal could possibly be null. On the first 
reading of code, it wasn't obvious to me.
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/DynamicIntArray.java:138 Better 
name : getSizeInBytes ?
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/MemoryManager.java:142 It will 
be good to add a comment saying that every 5000 rows added across all writers 
we request each writer to flush their content to disk if they are using memory 
beyond their quota.
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/OutStream.java:233 I didnt get 
when current ByteBuffer could be null. It will always be non-null when this 
method is invoked. Isnt it? Will be good to add a comment if the case is 
otherwise.
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/OutStream.java:236 Just for my 
own clarity, this will be null when compression is off, right ?
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/OutStream.java:37 It will be 
good to add a comment for all these 3 ByteBuffers for what kind of data are 
they holding.
  ql/src/java/org/apache/hadoop/hive/ql/io/orc/WriterImpl.java:687 Pardon my 
ignorance. I didn't get what countOutput was meant for earlier and why it is no 
longer required.

REVISION DETAIL
  https://reviews.facebook.net/D10545

BRANCH
  h-4421

ARCANIST PROJECT
  hive

To: JIRA, ashutoshc, omalley

                
> Improve memory usage by ORC dictionaries
> ----------------------------------------
>
>                 Key: HIVE-4421
>                 URL: https://issues.apache.org/jira/browse/HIVE-4421
>             Project: Hive
>          Issue Type: Bug
>          Components: Serializers/Deserializers
>            Reporter: Owen O'Malley
>            Assignee: Owen O'Malley
>             Fix For: 0.11.0
>
>         Attachments: HIVE-4421.D10545.1.patch, HIVE-4421.D10545.2.patch, 
> HIVE-4421.D10545.3.patch
>
>
> Currently, for tables with many string columns, it is possible to 
> significantly underestimate the memory used by the ORC dictionaries and cause 
> the query to run out of memory in the task. 

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to