[jira] [Commented] (ARROW-3246) [Python][Parquet] direct reading/writing of pandas categoricals in parquet

Hatem Helal (JIRA) Wed, 07 Aug 2019 04:20:05 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901988#comment-16901988
 ]


Hatem Helal commented on ARROW-3246:
------------------------------------

Adding {{TypedColumnWriter<T>::WriteArrow(const ::arrow::Array&)}} makes a lot 
of sense to me. [~wesmckinn] do you have a list of cases that you know can be 
optimized? The main one I'm aware of is the [dictionary 
array|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.cc#L1079]
 case, but but I'm curious if there are others arrow types that could be 
handled more efficiently.

As an aside, has it ever been considered to automatically tune the size of the 
dictionary page? I think for the limited case where of writing 
{{arrow::DictionaryArray}} we might want to ensure that the encoder doesn't 
fallback to plain encoding. That could be handled as a separate feature.

> [Python][Parquet] direct reading/writing of pandas categoricals in parquet
> --------------------------------------------------------------------------
>
>                 Key: ARROW-3246
>                 URL: https://issues.apache.org/jira/browse/ARROW-3246
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Martin Durant
>            Assignee: Wes McKinney
>            Priority: Minor
>              Labels: parquet
>             Fix For: 1.0.0
>
>
> Parquet supports "dictionary encoding" of column data in a manner very 
> similar to the concept of Categoricals in pandas. It is natural to use this 
> encoding for a column which originated as a categorical. Conversely, when 
> loading, if the file metadata says that a given column came from a pandas (or 
> arrow) categorical, then we can trust that the whole of the column is 
> dictionary-encoded and load the data directly into a categorical column, 
> rather than expanding the labels upon load and recategorising later.
> If the data does not have the pandas metadata, then the guarantee cannot 
> hold, and we cannot assume either that the whole column is dictionary encoded 
> or that the labels are the same throughout. In this case, the current 
> behaviour is fine.
>  
> (please forgive that some of this has already been mentioned elsewhere; this 
> is one of the entries in the list at 
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful 
> in fastparquet)



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)

[jira] [Commented] (ARROW-3246) [Python][Parquet] direct reading/writing of pandas categoricals in parquet

Reply via email to