[jira] [Commented] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray

Micah Kornfield (JIRA) Wed, 17 Jul 2019 21:28:50 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-3772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16887610#comment-16887610
 ]


Micah Kornfield commented on ARROW-3772:
----------------------------------------

"I'm looking at this. This is not a small project – the assumption that values 
are fully materialized is pretty deeply baked into the library. We also have to 
deal with the "fallback" case where a column chunk starts out dictionary 
encoded and switches mid-stream because the dictionary got too big"

I don't have context on how we decided originally to designate an entire column 
dictionary encoded vs a chunk/record batch column but it seems like this might 
be another use-case where the proposal on encoding/compression might make 
things easier to code (i.e. specify dictionary encoding only on 
SparseRecordBatches where it makes sense and leave the fallback to dense 
encoding where it no longer makes sense).

> [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow 
> DictionaryArray
> -----------------------------------------------------------------------------------------
>
>                 Key: ARROW-3772
>                 URL: https://issues.apache.org/jira/browse/ARROW-3772
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: C++
>            Reporter: Stav Nir
>            Assignee: Wes McKinney
>            Priority: Major
>              Labels: parquet
>             Fix For: 1.0.0
>
>
> Dictionary data is very common in parquet, in the current implementation 
> parquet-cpp decodes dictionary encoded data always before creating a plain 
> arrow array. This process is wasteful since we could use arrow's 
> DictionaryArray directly and achieve several benefits:
>  # Smaller memory footprint - both in the decoding process and in the 
> resulting arrow table - especially when the dict values are large
>  # Better decoding performance - mostly as a result of the first bullet - 
> less memory fetches and less allocations.
> I think those benefits could achieve significant improvements in runtime.
> My direction for the implementation is to read the indices (through the 
> DictionaryDecoder, after the RLE decoding) and values separately into 2 
> arrays and create a DictionaryArray using them.
> There are some questions to discuss:
>  # Should this be the default behavior for dictionary encoded data
>  # Should it be controlled with a parameter in the API
>  # What should be the policy in case some of the chunks are dictionary 
> encoded and some are not.
> I started implementing this but would like to hear your opinions.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (ARROW-3772) [C++] Read Parquet dictionary encoded ColumnChunks directly into an Arrow DictionaryArray

Reply via email to