[
https://issues.apache.org/jira/browse/ARROW-3246?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16901506#comment-16901506
]
Wes McKinney commented on ARROW-3246:
-------------------------------------
I've been looking at what's required to write {{arrow::DictionaryArray}}
directly into the appropriate lower-level ColumnWriter class. The trouble with
the way the software is layered right now is that there is a "Chinese wall"
between {{TypedColumnWriter<T>}} and the Arrow write layer. We can only
communicate with this class using the Parquet C types such as {{ByteArray}} and
{{FixedLenByteArray}}. This is also a performance issue since we cannot write
directly into the writer from {{arrow::BinaryArray}} or similar cases where it
might make sense.
I think the only way to fix the current situation is to add a
{{TypedColumnWriter<T>::WriteArrow(const ::arrow::Array&)}} method and "push
down" a lot of the logic that's currently in parquet/arrow/writer.cc into the
{{TypedColumnWriter<T>}} implementation. This will enable us to do various
write performance optimizations and also address the direct dictionary write
issue. This is not a small project, but I would say that it's overdue and will
put us on a better footing going forward
cc [~xhochy] [~hatem] for any thoughts
> [Python][Parquet] direct reading/writing of pandas categoricals in parquet
> --------------------------------------------------------------------------
>
> Key: ARROW-3246
> URL: https://issues.apache.org/jira/browse/ARROW-3246
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Python
> Reporter: Martin Durant
> Assignee: Wes McKinney
> Priority: Minor
> Labels: parquet
> Fix For: 1.0.0
>
>
> Parquet supports "dictionary encoding" of column data in a manner very
> similar to the concept of Categoricals in pandas. It is natural to use this
> encoding for a column which originated as a categorical. Conversely, when
> loading, if the file metadata says that a given column came from a pandas (or
> arrow) categorical, then we can trust that the whole of the column is
> dictionary-encoded and load the data directly into a categorical column,
> rather than expanding the labels upon load and recategorising later.
> If the data does not have the pandas metadata, then the guarantee cannot
> hold, and we cannot assume either that the whole column is dictionary encoded
> or that the labels are the same throughout. In this case, the current
> behaviour is fine.
>
> (please forgive that some of this has already been mentioned elsewhere; this
> is one of the entries in the list at
> [https://github.com/dask/fastparquet/issues/374] as a feature that is useful
> in fastparquet)
--
This message was sent by Atlassian JIRA
(v7.6.14#76016)