lesterfan opened a new pull request, #45685:
URL: https://github.com/apache/arrow/pull/45685

   ### Rationale for this change
   This fixes the bug reported in Github issue 
https://github.com/apache/arrow/issues/30302.
   
   ### What changes are included in this PR?
   This PR does the following:
   1. Replaces the hardcoded `int32` inferred dictionary index type with the 
faithful index type stored in the parquet file metadata.
   1. Replaces a `View` call with a more permissive `ViewOrCastChunkedArray` in 
`TransferDictionary` where the built dictionary is transferred to the caller. 
This function does an explicit cast if the `View` fails, allowing e.g. 
`dictionary<values=string, indices=int32, ordered=0>` to be interpreted as 
`dictionary<values=string, indices=int8, ordered=0>` even though the layouts 
aren't equal.
   1. Adds a unit test which passes on this branch and fails on main (due to 
the schema types `dictionary<values=string, indices=int8, ordered=0>`, 
`dictionary<values=string, indices=int32, ordered=0>` not being equal).
   
   Note that there is a performance overhead when casting. I did the following 
experiment to measure this.
   
   With the code:
   ```
   #include <stdio.h>
   
   #include <filesystem>
   #include <fstream>
   #include <iostream>
   #include <string>
   
   #include "arrow/api.h"
   #include "arrow/compute/api.h"
   #include "arrow/io/api.h"
   #include <arrow/api.h>
   #include <arrow/io/file.h>
   #include <parquet/arrow/reader.h>
   #include <parquet/arrow/writer.h>
   
   arrow::Status arrow_test() {
       arrow::MemoryPool* pool = arrow::default_memory_pool();
   
       std::shared_ptr<arrow::Array> values;
       arrow::DictionaryBuilder<arrow::StringType> dict_builder(pool);
       for (int i = 0; i < 100'000'000; ++i) {
           ARROW_RETURN_NOT_OK(dict_builder.Append("abc"));
           ARROW_RETURN_NOT_OK(dict_builder.Append("def"));
           ARROW_RETURN_NOT_OK(dict_builder.Append("ghi"));
       }
       ARROW_RETURN_NOT_OK(dict_builder.Finish(&values));
   
       auto dict_type = dict_builder.type();
       auto schema = arrow::schema({arrow::field("x", dict_type)});
   
       auto table = arrow::Table::Make(schema, {values});
   
       const char* filepath = "/not/a/real/filepath"; // Replace with a real 
filepath
       std::shared_ptr<arrow::io::FileOutputStream> outfile;
       arrow::io::FileOutputStream::Open(filepath).ValueOrDie().swap(outfile);
   
       auto writer_props = parquet::WriterProperties::Builder().build();
       // This is important; if this is not set, the writer won't add the
       // "ARROW:schema": <base 64 encoded schema> metadata to the parquet file,
       // and the reader will read the column as a string rather than a 
dictionary.
       auto arrow_writer_props =
               
parquet::ArrowWriterProperties::Builder().store_schema()->build();
   
       if (auto write_table_status = parquet::arrow::WriteTable(
                   *table, pool, outfile, 1024, writer_props, 
arrow_writer_props);
           !write_table_status.ok()) {
           printf("Error writing table to file: %s\n",
                  write_table_status.ToString().c_str());
           return write_table_status;
       }
   
       std::shared_ptr<arrow::io::ReadableFile> infile;
       arrow::io::ReadableFile::Open(filepath).ValueOrDie().swap(infile);
       std::unique_ptr<parquet::arrow::FileReader> reader;
       ARROW_ASSIGN_OR_RAISE(reader, parquet::arrow::OpenFile(infile, pool));
   
       std::shared_ptr<arrow::Table> table_round_trip;
       ARROW_RETURN_NOT_OK(reader->ReadTable(&table_round_trip));
   
       printf("table: %s\n", table->ToString().c_str());
       printf("table_round_trip: %s\n", table_round_trip->ToString().c_str());
   
       return arrow::Status::OK();
   }
   
   int main() {
       auto status = arrow_test();
       if (!status.ok()) {
           printf("Not ok!, status: %s\n", status.ToString().c_str());
           return 1;
       }
       printf("ok\n");
   }
   ```
   I wrote a parquet file with 300 million rows but only three unique values 
(which fit into an `int8_t`). I then commented out the code which writes the 
parquet file to purely measure the read performance. Here are the results:
   ```
   (without the change locally, 300M rows):
   ,/reproduce_experiment  125.04s user 1.01s system 98% cpu 2:07.80 total
   (with the change locally, 300M rows):
   ./reproduce_experiment  139.79s user 0.98s system 99% cpu 2:20.95 total
   ``` 
   The performance impact is minimal and only in the case where a cast is 
necessary (i.e. this impact would not exist if the written type had an `int32` 
index). I would argue this is preferable to the status quo, where the type 
returned is incorrect (i.e. not faithful to the original type written).
   
   A potential implementation which wouldn't involve this casting would involve 
templating the `::arrow::BinaryDictionary32Builder` 
[here](https://github.com/apache/arrow/blob/784aa6faf69f5cf135e09976a281dea9ebf58166/cpp/src/parquet/column_reader.cc#L2176)
 by the index type (instead of a hard-coded `int32` type) so that the correct 
bit-width is used while reading the file. This would be a more involved change.
   
   ### Are these changes tested?
   Yes, though the test is fairly basic; let me know if you guys have any 
suggestions on how to improve this!
   
   ### Are there any user-facing changes?
   Yes, this PR implements a bug fix to a user facing API.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

Reply via email to