[
https://issues.apache.org/jira/browse/ORC-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
xiaoli updated ORC-1060:
------------------------
Description:
We are upgrading spark version from 2.2 to 3.0. During this work, we find
spark3.0 uses higher memory than spark2.2 when reading ORC string dictionary
encoding column.
The reason is:
spark2.2 use hive's lib to read ORC
[https://github.com/aixuebo/hive1.2.1.ql/blob/master/java/org/apache/hadoop/hive/ql/io/orc/TreeReaderFactory.java]
In this code, StringDictionaryTreeReader class with row read interface hold
only one string dictionary in memory when reading across multiple file stripes.
spark3.0 use orc lib to read ORC
[https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java]
In this code, StringDictionaryTreeReader class with batch read interface could
hold 3 string dictionary in memory when reading across multiple file stripes:
2 copy of current stripe's dictionary data (dictionaryBuffer variable and
dictionaryBufferInBytesCache variable) and 1 copy of next stripe's dictionary
data (dictionaryBuffer variable, when call advanceToNextRow method in
RecordReaderImpl class's nextBatch method)
was:
We are upgrading spark version from 2.2 to 3.0. During this work, we find
spark3.0 uses higher memory than spark2.2 when reading ORC string dictionary
encoding column.
The reason is:
spark2.2 use hive's lib to read ORC
[https://github.com/aixuebo/hive1.2.1.ql/blob/master/java/org/apache/hadoop/hive/ql/io/orc/TreeReaderFactory.java]
In this code, StringDictionaryTreeReader class with row read interface hold
only one string dictionary in memory when reading across multiple file stripes.
spark3.0 use orc lib to read ORC
[https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java]
In this code, StringDictionaryTreeReader class with batch read interface could
hold 3 string dictionary in memory when reading across multiple file stripes:
2 copy of current stripe's dictionary data (dictionaryBuffer variable and
dictionaryBufferInBytesCache variable)
and 1 copy of next stripe's dictionary data (dictionaryBuffer variable, when
call
advanceToNextRow method in RecordReaderImpl class's nextBatch method)
> batch read with Java interface uses high memory when reading ORC string
> dictionary encoding column
> --------------------------------------------------------------------------------------------------
>
> Key: ORC-1060
> URL: https://issues.apache.org/jira/browse/ORC-1060
> Project: ORC
> Issue Type: Improvement
> Components: Java, Reader
> Affects Versions: 1.5.13
> Reporter: xiaoli
> Priority: Minor
>
> We are upgrading spark version from 2.2 to 3.0. During this work, we find
> spark3.0 uses higher memory than spark2.2 when reading ORC string dictionary
> encoding column.
> The reason is:
> spark2.2 use hive's lib to read ORC
> [https://github.com/aixuebo/hive1.2.1.ql/blob/master/java/org/apache/hadoop/hive/ql/io/orc/TreeReaderFactory.java]
> In this code, StringDictionaryTreeReader class with row read interface hold
> only one string dictionary in memory when reading across multiple file
> stripes.
> spark3.0 use orc lib to read ORC
> [https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java]
> In this code, StringDictionaryTreeReader class with batch read interface
> could hold 3 string dictionary in memory when reading across multiple file
> stripes:
> 2 copy of current stripe's dictionary data (dictionaryBuffer variable and
> dictionaryBufferInBytesCache variable) and 1 copy of next stripe's dictionary
> data (dictionaryBuffer variable, when call advanceToNextRow method in
> RecordReaderImpl class's nextBatch method)
--
This message was sent by Atlassian Jira
(v8.20.1#820001)