[jira] [Updated] (ORC-299) Improve heuristics for bailing on dictionary encoding

Chris Drome (JIRA) Wed, 07 Feb 2018 16:33:23 -0800

     [ 
https://issues.apache.org/jira/browse/ORC-299?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Chris Drome updated ORC-299:
----------------------------
    Description: 
Recently a user ran into the following failure:
{noformat}
Caused by: java.lang.NullPointerException at java.lang.System.arraycopy(Native 
Method) at
  
org.apache.hadoop.hive.ql.io.orc.DynamicByteArray.add(DynamicByteArray.java:115)
 at
  
org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.addNewKey(StringRedBlackTree.java:48)
 at
  
org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.add(StringRedBlackTree.java:55)
 at
  
org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.write(WriterImpl.java:1250)
 at
  
org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.write(WriterImpl.java:1797)
 at
  org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2469) at
  
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:86)
 at
  
org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:753)
 at
  org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838) at
  org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88) 
at
  org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838) at
  
org.apache.hadoop.hive.ql.exec.FilterOperator.process(FilterOperator.java:122) 
at
  org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838) at
  
org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:110)
 at
  
org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:165)
 at
  org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:536) ... 
18 more

{noformat}
 

I tracked this down to the following in DynamicByteArray.java, which is being 
used to create the dictionary for a particular column:
{noformat}
private int length;

{noformat}
 

This has the side-effect of capping the memory available for the dictionary at 
2GB.

 

Given the size of column values in this use case, and the fact that the user is 
exceeding this 2GB limit, there should probably be some heuristics that bail 
early on dictionary creation, so this limitation is never reached. Given the 
size of data that would be required to hit this limit, it is unlikely that a 
dictionary would be useful.

  was:
Recently a user ran into the following failure:

{noformat}

Caused by: java.lang.NullPointerException at java.lang.System.arraycopy(Native 
Method) at 
org.apache.hadoop.hive.ql.io.orc.DynamicByteArray.add(DynamicByteArray.java:115)
 at 
org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.addNewKey(StringRedBlackTree.java:48)
 at 
org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.add(StringRedBlackTree.java:55)
 at 
org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.write(WriterImpl.java:1250)
 at 
org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.write(WriterImpl.java:1797)
 at org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2469) at 
org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:86)
 at 
org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:753)
 at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838) at 
org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88) 
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838) at 
org.apache.hadoop.hive.ql.exec.FilterOperator.process(FilterOperator.java:122) 
at org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838) at 
org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:110)
 at 
org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:165)
 at org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:536) 
... 18 more

{noformat}

 

I tracked this down to the following in DynamicByteArray.java, which is being 
used to create the dictionary for a particular column:

{noformat}

private int length;

{noformat}

 

This has the side-effect of capping the memory available for the dictionary at 
2GB.

 

Given the size of column values in this use case, and the fact that the user is 
exceeding this 2GB limit, there should probably be some heuristics that bail 
early on dictionary creation, so this limitation is never reached. Given the 
size of data that would be required to hit this limit, it is unlikely that a 
dictionary would be useful.


> Improve heuristics for bailing on dictionary encoding
> -----------------------------------------------------
>
>                 Key: ORC-299
>                 URL: https://issues.apache.org/jira/browse/ORC-299
>             Project: ORC
>          Issue Type: Improvement
>            Reporter: Chris Drome
>            Priority: Major
>
> Recently a user ran into the following failure:
> {noformat}
> Caused by: java.lang.NullPointerException at 
> java.lang.System.arraycopy(Native Method) at
>   
> org.apache.hadoop.hive.ql.io.orc.DynamicByteArray.add(DynamicByteArray.java:115)
>  at
>   
> org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.addNewKey(StringRedBlackTree.java:48)
>  at
>   
> org.apache.hadoop.hive.ql.io.orc.StringRedBlackTree.add(StringRedBlackTree.java:55)
>  at
>   
> org.apache.hadoop.hive.ql.io.orc.WriterImpl$StringTreeWriter.write(WriterImpl.java:1250)
>  at
>   
> org.apache.hadoop.hive.ql.io.orc.WriterImpl$StructTreeWriter.write(WriterImpl.java:1797)
>  at
>   org.apache.hadoop.hive.ql.io.orc.WriterImpl.addRow(WriterImpl.java:2469) at
>   
> org.apache.hadoop.hive.ql.io.orc.OrcOutputFormat$OrcRecordWriter.write(OrcOutputFormat.java:86)
>  at
>   
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.process(FileSinkOperator.java:753)
>  at
>   org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838) at
>   
> org.apache.hadoop.hive.ql.exec.SelectOperator.process(SelectOperator.java:88) 
> at
>   org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838) at
>   
> org.apache.hadoop.hive.ql.exec.FilterOperator.process(FilterOperator.java:122)
>  at
>   org.apache.hadoop.hive.ql.exec.Operator.forward(Operator.java:838) at
>   
> org.apache.hadoop.hive.ql.exec.TableScanOperator.process(TableScanOperator.java:110)
>  at
>   
> org.apache.hadoop.hive.ql.exec.MapOperator$MapOpCtx.forward(MapOperator.java:165)
>  at
>   org.apache.hadoop.hive.ql.exec.MapOperator.process(MapOperator.java:536) 
> ... 18 more
> {noformat}
>  
> I tracked this down to the following in DynamicByteArray.java, which is being 
> used to create the dictionary for a particular column:
> {noformat}
> private int length;
> {noformat}
>  
> This has the side-effect of capping the memory available for the dictionary 
> at 2GB.
>  
> Given the size of column values in this use case, and the fact that the user 
> is exceeding this 2GB limit, there should probably be some heuristics that 
> bail early on dictionary creation, so this limitation is never reached. Given 
> the size of data that would be required to hit this limit, it is unlikely 
> that a dictionary would be useful.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ORC-299) Improve heuristics for bailing on dictionary encoding

Reply via email to