[jira] [Updated] (ORC-1060) batch read with Java interface uses high memory when reading ORC string dictionary encoding column

xiaoli (Jira) Mon, 13 Dec 2021 23:26:36 -0800


     [ 
https://issues.apache.org/jira/browse/ORC-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


xiaoli updated ORC-1060:
------------------------
    Description: 
We are upgrading spark version from 2.2 to 3.0. During this work, we find 
spark3.0 uses higher memory than spark2.2 when reading ORC string dictionary 
encoding column.

The reason is:

spark2.2 use hive's lib to read ORC 
[https://github.com/aixuebo/hive1.2.1.ql/blob/master/java/org/apache/hadoop/hive/ql/io/orc/TreeReaderFactory.java]
  In this code, StringDictionaryTreeReader class with row read interface hold 
only one string dictionary in memory when reading across multiple file stripes.

spark3.0 use orc lib to read ORC

[https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java]
 In this code, StringDictionaryTreeReader class with batch read interface could 
hold 3 string dictionary in memory when reading across multiple file stripes:

2 copy of current stripe's dictionary data (dictionaryBuffer variable and 
dictionaryBufferInBytesCache variable) and 1 copy of next stripe's dictionary 
data  (dictionaryBuffer variable, when call advanceToNextRow method in 
RecordReaderImpl class's nextBatch method)

  was:
We are upgrading spark version from 2.2 to 3.0. During this work, we find 
spark3.0 uses higher memory than spark2.2 when reading ORC string dictionary 
encoding column.

The reason is:

spark2.2 use hive's lib to read ORC 
[https://github.com/aixuebo/hive1.2.1.ql/blob/master/java/org/apache/hadoop/hive/ql/io/orc/TreeReaderFactory.java]
  In this code, StringDictionaryTreeReader class with row read interface hold 
only one string dictionary in memory when reading across multiple file stripes.

spark3.0 use orc lib to read ORC

[https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java]
 In this code, StringDictionaryTreeReader class with batch read interface could 
hold 3 string dictionary in memory when reading across multiple file stripes:

2 copy of current stripe's dictionary data (dictionaryBuffer variable and 
dictionaryBufferInBytesCache variable)

and 1 copy of next stripe's dictionary data  (dictionaryBuffer variable, when 
call

advanceToNextRow method in RecordReaderImpl class's nextBatch method)


> batch read with Java interface uses high memory when reading ORC string 
> dictionary encoding column
> --------------------------------------------------------------------------------------------------
>
>                 Key: ORC-1060
>                 URL: https://issues.apache.org/jira/browse/ORC-1060
>             Project: ORC
>          Issue Type: Improvement
>          Components: Java, Reader
>    Affects Versions: 1.5.13
>            Reporter: xiaoli
>            Priority: Minor
>
> We are upgrading spark version from 2.2 to 3.0. During this work, we find 
> spark3.0 uses higher memory than spark2.2 when reading ORC string dictionary 
> encoding column.
> The reason is:
> spark2.2 use hive's lib to read ORC 
> [https://github.com/aixuebo/hive1.2.1.ql/blob/master/java/org/apache/hadoop/hive/ql/io/orc/TreeReaderFactory.java]
>   In this code, StringDictionaryTreeReader class with row read interface hold 
> only one string dictionary in memory when reading across multiple file 
> stripes.
> spark3.0 use orc lib to read ORC
> [https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java]
>  In this code, StringDictionaryTreeReader class with batch read interface 
> could hold 3 string dictionary in memory when reading across multiple file 
> stripes:
> 2 copy of current stripe's dictionary data (dictionaryBuffer variable and 
> dictionaryBufferInBytesCache variable) and 1 copy of next stripe's dictionary 
> data  (dictionaryBuffer variable, when call advanceToNextRow method in 
> RecordReaderImpl class's nextBatch method)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Updated] (ORC-1060) batch read with Java interface uses high memory when reading ORC string dictionary encoding column

Reply via email to