[
https://issues.apache.org/jira/browse/ARROW-8421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17332252#comment-17332252
]
Andrew Lamb commented on ARROW-8421:
------------------------------------
Migrated to github: https://github.com/apache/arrow-rs/issues/216
> [Rust] [Parquet] Implement parquet writer
> -----------------------------------------
>
> Key: ARROW-8421
> URL: https://issues.apache.org/jira/browse/ARROW-8421
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Rust
> Reporter: Andy Grove
> Assignee: Neville Dipale
> Priority: Major
> Labels: pull-request-available
> Fix For: 5.0.0
>
> Time Spent: 2h 10m
> Remaining Estimate: 0h
>
> This is the parent story. See subtasks for more information.
> Notes from [~wesm] :
> A couple of initial things to keep in mind
> * Writes of both Nullable (OPTIONAL) and non-nullable (REQUIRED) fields
> * You can optimize the special case where a nullable field's data has no
> nulls
> * A good amount of code is required to handle converting from the Arrow
> physical form of various logical types to the Parquet equivalent one, see
> [https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc]
> for details
> * It would be worth thinking up front about how dictionary-encoded data is
> handled both on the Arrow write and Arrow read paths. In parquet-cpp we
> initially discarded Arrow DictionaryArrays on write (casting e.g. Dictionary
> to dense String), and through real world need I was forced to revisit this
> (quite painfully) to enable Arrow dictionaries to survive roundtrips to
> Parquet format, and also achieve better performance and memory use in both
> reads and writes. You can certainly do a dictionary-to-dense conversion like
> we did, but you may someday find yourselves doing the same painful refactor
> that I did to make dictionary write and read not only more efficient but also
> dictionary order preserving.
> Notes from [~sunchao] :
> I roughly skimmed through the C++ implementation and think on the high level
> we need to do the following:
> # implement a method similar to {{WriteArrow}} in
> [column_writer.cc|https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc].
> We can further break this up into smaller pieces such as:
> dictionary/non-dictionary, primitive types, booleans, timestamps, dates, so
> on and so forth.
> # implement an arrow writer in the parquet crate
> [here|https://github.com/apache/arrow/tree/master/rust/parquet/src/arrow].
> This needs to offer similar APIs as
> [writer.h|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.h].
--
This message was sent by Atlassian Jira
(v8.3.4#803005)