Re: [I] c++ Parquet StreamWriter Dictionary Encoding [arrow]

via GitHub Tue, 27 Aug 2024 23:34:52 -0700


adampinky85 commented on issue #43682:
URL: https://github.com/apache/arrow/issues/43682#issuecomment-2314431937


   Thanks @mapleFU, with our current non-streaming approach, see below, we're 
building an arrow schema which provides the string dictionary fields. These are 
then stored into the parquet file using the `store_schema` approach as you 
suggest.
   
   With the the streaming API, see top of the issue, we're unable to set 
dictionary fields. I would have thought it will always store these fields as 
byte array strings rather than integers with dictionaries?
   
   ```
     // schema and builder
     auto schema =
         arrow::schema({
                        arrow::field("foo", 
arrow::timestamp(arrow::TimeUnit::MILLI)),
                        arrow::field("bar", dictionary(arrow::int8(), 
arrow::utf8())),
                        arrow::field("baz", arrow::float64())
        });
   
     auto foo_builder = 
arrow::TimestampBuilder(arrow::timestamp(arrow::TimeUnit::MILLI),
                                                      
arrow::default_memory_pool());
     auto bar_builder = arrow::StringDictionaryBuilder{};
     auto baz_builder= arrow::DoubleBuilder();
   
     ...
   
     // parquet file output
     const auto parquet_properties = parquet::WriterProperties::Builder()
                                         
.compression(arrow::Compression::SNAPPY)
                                         
->data_page_version(parquet::ParquetDataPageVersion::V2)
                                         ->enable_dictionary()
                                         
->encoding(parquet::Encoding::DELTA_BINARY_PACKED)
                                         
->version(parquet::ParquetVersion::PARQUET_2_6)
                                         ->build();
   
     // required to store the arrow schema for pandas to retrieve categorical 
types
     const auto arrow_properties = 
parquet::ArrowWriterProperties::Builder().store_schema()->build();
   
   ```
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] c++ Parquet StreamWriter Dictionary Encoding [arrow]

Reply via email to