[ 
https://issues.apache.org/jira/browse/ARROW-8421?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17332252#comment-17332252
 ] 

Andrew Lamb commented on ARROW-8421:
------------------------------------

Migrated to github: https://github.com/apache/arrow-rs/issues/216

> [Rust] [Parquet] Implement parquet writer
> -----------------------------------------
>
>                 Key: ARROW-8421
>                 URL: https://issues.apache.org/jira/browse/ARROW-8421
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Rust
>            Reporter: Andy Grove
>            Assignee: Neville Dipale
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 5.0.0
>
>          Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> This is the parent story. See subtasks for more information.
> Notes from [~wesm] :
> A couple of initial things to keep in mind
>  * Writes of both Nullable (OPTIONAL) and non-nullable (REQUIRED) fields
>  * You can optimize the special case where a nullable field's data has no 
> nulls
>  * A good amount of code is required to handle converting from the Arrow 
> physical form of various logical types to the Parquet equivalent one, see 
> [https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc]
>  for details
>  * It would be worth thinking up front about how dictionary-encoded data is 
> handled both on the Arrow write and Arrow read paths. In parquet-cpp we 
> initially discarded Arrow DictionaryArrays on write (casting e.g. Dictionary 
> to dense String), and through real world need I was forced to revisit this 
> (quite painfully) to enable Arrow dictionaries to survive roundtrips to 
> Parquet format, and also achieve better performance and memory use in both 
> reads and writes. You can certainly do a dictionary-to-dense conversion like 
> we did, but you may someday find yourselves doing the same painful refactor 
> that I did to make dictionary write and read not only more efficient but also 
> dictionary order preserving.
> Notes from [~sunchao] :
> I roughly skimmed through the C++ implementation and think on the high level 
> we need to do the following:
>  # implement a method similar to {{WriteArrow}} in 
> [column_writer.cc|https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc].
>  We can further break this up into smaller pieces such as: 
> dictionary/non-dictionary, primitive types, booleans, timestamps, dates, so 
> on and so forth.
>  # implement an arrow writer in the parquet crate 
> [here|https://github.com/apache/arrow/tree/master/rust/parquet/src/arrow]. 
> This needs to offer similar APIs as 
> [writer.h|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.h].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to