adampinky85 commented on issue #43682:
URL: https://github.com/apache/arrow/issues/43682#issuecomment-2314431937
Thanks @mapleFU, with our current non-streaming approach, see below, we're
building an arrow schema which provides the string dictionary fields. These are
then stored into the parquet file using the `store_schema` approach as you
suggest.
With the the streaming API, see top of the issue, we're unable to set
dictionary fields. I would have thought it will always store these fields as
byte array strings rather than integers with dictionaries?
```
// schema and builder
auto schema =
arrow::schema({
arrow::field("foo",
arrow::timestamp(arrow::TimeUnit::MILLI)),
arrow::field("bar", dictionary(arrow::int8(),
arrow::utf8())),
arrow::field("baz", arrow::float64())
});
auto foo_builder =
arrow::TimestampBuilder(arrow::timestamp(arrow::TimeUnit::MILLI),
arrow::default_memory_pool());
auto bar_builder = arrow::StringDictionaryBuilder{};
auto baz_builder= arrow::DoubleBuilder();
...
// parquet file output
const auto parquet_properties = parquet::WriterProperties::Builder()
.compression(arrow::Compression::SNAPPY)
->data_page_version(parquet::ParquetDataPageVersion::V2)
->enable_dictionary()
->encoding(parquet::Encoding::DELTA_BINARY_PACKED)
->version(parquet::ParquetVersion::PARQUET_2_6)
->build();
// required to store the arrow schema for pandas to retrieve categorical
types
const auto arrow_properties =
parquet::ArrowWriterProperties::Builder().store_schema()->build();
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]