[jira] [Commented] (ORC-1060) batch read with Java interface uses high memory when reading ORC string dictionary encoding column

Dongjoon Hyun (Jira) Wed, 29 Dec 2021 12:28:04 -0800


    [ 
https://issues.apache.org/jira/browse/ORC-1060?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17466585#comment-17466585
 ]


Dongjoon Hyun commented on ORC-1060:
------------------------------------

I'm reconsidering this issue to handle a regression in downstream project 
perspective.

> batch read with Java interface uses high memory when reading ORC string 
> dictionary encoding column
> --------------------------------------------------------------------------------------------------
>
>                 Key: ORC-1060
>                 URL: https://issues.apache.org/jira/browse/ORC-1060
>             Project: ORC
>          Issue Type: Bug
>          Components: Java, Reader
>    Affects Versions: 1.7.0, 1.8.0, 1.7.1, 1.7.2
>            Reporter: xiaoli
>            Assignee: xiaoli
>            Priority: Major
>             Fix For: 1.8.0
>
>
> We are upgrading spark version from 2.2 to 3.0. During this work, we find 
> spark3.0 uses higher memory than spark2.2 when reading ORC string dictionary 
> encoding column.
> The reason is:
> spark2.2 use hive's lib to read ORC 
> [https://github.com/aixuebo/hive1.2.1.ql/blob/master/java/org/apache/hadoop/hive/ql/io/orc/TreeReaderFactory.java]
>   In this code, StringDictionaryTreeReader class with row read interface hold 
> only one string dictionary in memory when reading across multiple file 
> stripes.
> spark3.0 use orc lib to read ORC
> [https://github.com/apache/orc/blob/main/java/core/src/java/org/apache/orc/impl/TreeReaderFactory.java]
>  In this code, StringDictionaryTreeReader class with batch read interface 
> could hold 3 string dictionary in memory when reading across multiple file 
> stripes: 2 copy of current stripe's dictionary data (dictionaryBuffer 
> variable and dictionaryBufferInBytesCache variable) and 1 copy of next 
> stripe's dictionary data  (dictionaryBuffer variable, when call 
> advanceToNextRow method in RecordReaderImpl class's nextBatch method)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

[jira] [Commented] (ORC-1060) batch read with Java interface uses high memory when reading ORC string dictionary encoding column

Reply via email to