Re: Round-trip of categorical data with Arrow and Parquet

Hatem Helal Thu, 24 Jan 2019 08:59:49 -0800

Thanks Wes,

Glad to hear this in your plan.


I probably should have done this earlier...but here are some JIRA tickets that 
seem to cover this:

https://issues.apache.org/jira/browse/ARROW-3772
https://issues.apache.org/jira/browse/ARROW-3325
https://issues.apache.org/jira/browse/ARROW-3769



On 1/24/19, 4:27 PM, "Wes McKinney" <[email protected]> wrote:

    hi Hatem,
    
    There are several issues open about this already (I'll have to dig
    them up), so this is something that we have desired for a long time,
    but have not gotten around to implementing.
    
    Since many Parquet writers use dictionary encoding, it would make most
    sense to have an option to return DictionaryArray (which can be
    converted to pandas.Categorical) from any column, and internally we
    will perform the conversion from the encoded Parquet format as
    efficiently as possible.
    
    There are many cases to consider:
    
    * Dictionary encoded, but different dictionaries in each row group
    (this is actually the most likely scenario)
    * Dictionary encoded, but the same dictionary in all row groups
    * PLAIN encoded data that we pass through DictionaryBuilder as it is
    decoded to yield DictionaryArray
    * Dictionary encoded, but switch over to PLAIN encoding mid-stream
    
    Having column metadata to automatically "opt in" to the
    DictionaryArray conversion sounds reasonable (so long as Arrow readers
    have a way to opt out, probably via a global flag to ignore such
    custom metadata fields) for usability.
    
    Part of the reason this work was not done in the past was because some
    of our hash table machinery was a bit immature. Antoine has recently
    improved things significantly, so it should be a lot easier now to do
    this work. This is a quite large project, though, and one that affects
    a _lot_ of users, so I would be willing to take an initial pass on the
    implementation.
    
    Along with completing the nested data read/write path I would say this
    is the 2nd highest priority project in parquet-cpp for Arrow users.
    
    - Wes
    
    On Thu, Jan 24, 2019 at 9:59 AM Hatem Helal <[email protected]> 
wrote:
    >
    > Hi everyone,
    >
    > I wanted to gauge interest and feasibility for adding support for 
natively reading an arrow::DictionaryArray from a parquet file.  Currently, 
writing an arrow::DictionaryArray is read back as the native index type [0].  I 
came across a prior discussion for this problem in the context of pandas [1] 
but I think this would be useful for other arrow clients (C++ or otherwise).
    >
    > The solution I had in mind would be to add arrow type information as 
column metadata.  This metadata would then be used when reading back the 
parquet file to determine which arrow type to create for the column data.
    >
    > I’m willing to contribute this feature but first wanted to get some 
feedback on whether this would be generally useful and if the high-level 
proposed solution would make sense.
    >
    > Thanks!
    >
    > Hatem
    >
    >
    > [0] This test demonstrates this behavior
    > 
https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/arrow-reader-writer-test.cc#L1848
    > [1] https://github.com/apache/arrow/issues/1688

Re: Round-trip of categorical data with Arrow and Parquet

Reply via email to