[
https://issues.apache.org/jira/browse/ARROW-8421?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Andy Grove updated ARROW-8421:
------------------------------
Description:
This is the parent story. See subtasks for more information.
Notes from [~wesm] :
A couple of initial things to keep in mind
* Writes of both Nullable (OPTIONAL) and non-nullable (REQUIRED) fields
* You can optimize the special case where a nullable field's data has no nulls
* A good amount of code is required to handle converting from the Arrow
physical form of various logical types to the Parquet equivalent one, see
[https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc]
for details
* It would be worth thinking up front about how dictionary-encoded data is
handled both on the Arrow write and Arrow read paths. In parquet-cpp we
initially discarded Arrow DictionaryArrays on write (casting e.g. Dictionary to
dense String), and through real world need I was forced to revisit this (quite
painfully) to enable Arrow dictionaries to survive roundtrips to Parquet
format, and also achieve better performance and memory use in both reads and
writes. You can certainly do a dictionary-to-dense conversion like we did, but
you may someday find yourselves doing the same painful refactor that I did to
make dictionary write and read not only more efficient but also dictionary
order preserving.
Notes from [~sunchao] :
I roughly skimmed through the C++ implementation and think on the high level we
need to do the following:
# implement a method similar to {{WriteArrow}} in
[column_writer.cc|https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc].
We can further break this up into smaller pieces such as:
dictionary/non-dictionary, primitive types, booleans, timestamps, dates, so on
and so forth.
# implement an arrow writer in the parquet crate
[here|https://github.com/apache/arrow/tree/master/rust/parquet/src/arrow]. This
needs to offer similar APIs as
[writer.h|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.h].
was:
This is the parent story. See subtasks for more information.
A couple of initial things to keep in mind
* Writes of both Nullable (OPTIONAL) and non-nullable (REQUIRED) fields
* You can optimize the special case where a nullable field's data has no nulls
* A good amount of code is required to handle converting from the Arrow
physical form of various logical types to the Parquet equivalent one, see
[https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc]
for details
* It would be worth thinking up front about how dictionary-encoded data is
handled both on the Arrow write and Arrow read paths. In parquet-cpp we
initially discarded Arrow DictionaryArrays on write (casting e.g. Dictionary to
dense String), and through real world need I was forced to revisit this (quite
painfully) to enable Arrow dictionaries to survive roundtrips to Parquet
format, and also achieve better performance and memory use in both reads and
writes. You can certainly do a dictionary-to-dense conversion like we did, but
you may someday find yourselves doing the same painful refactor that I did to
make dictionary write and read not only more efficient but also dictionary
order preserving.
> [Rust] [Parquet] Implement parquet writer
> -----------------------------------------
>
> Key: ARROW-8421
> URL: https://issues.apache.org/jira/browse/ARROW-8421
> Project: Apache Arrow
> Issue Type: Improvement
> Components: Rust
> Reporter: Andy Grove
> Priority: Major
> Fix For: 1.0.0
>
>
> This is the parent story. See subtasks for more information.
> Notes from [~wesm] :
> A couple of initial things to keep in mind
> * Writes of both Nullable (OPTIONAL) and non-nullable (REQUIRED) fields
> * You can optimize the special case where a nullable field's data has no
> nulls
> * A good amount of code is required to handle converting from the Arrow
> physical form of various logical types to the Parquet equivalent one, see
> [https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc]
> for details
> * It would be worth thinking up front about how dictionary-encoded data is
> handled both on the Arrow write and Arrow read paths. In parquet-cpp we
> initially discarded Arrow DictionaryArrays on write (casting e.g. Dictionary
> to dense String), and through real world need I was forced to revisit this
> (quite painfully) to enable Arrow dictionaries to survive roundtrips to
> Parquet format, and also achieve better performance and memory use in both
> reads and writes. You can certainly do a dictionary-to-dense conversion like
> we did, but you may someday find yourselves doing the same painful refactor
> that I did to make dictionary write and read not only more efficient but also
> dictionary order preserving.
> Notes from [~sunchao] :
> I roughly skimmed through the C++ implementation and think on the high level
> we need to do the following:
> # implement a method similar to {{WriteArrow}} in
> [column_writer.cc|https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_writer.cc].
> We can further break this up into smaller pieces such as:
> dictionary/non-dictionary, primitive types, booleans, timestamps, dates, so
> on and so forth.
> # implement an arrow writer in the parquet crate
> [here|https://github.com/apache/arrow/tree/master/rust/parquet/src/arrow].
> This needs to offer similar APIs as
> [writer.h|https://github.com/apache/arrow/blob/master/cpp/src/parquet/arrow/writer.h].
--
This message was sent by Atlassian Jira
(v8.3.4#803005)