[GitHub] [arrow] agoncharuk opened a new issue, #35595: [C++] Dataset: Invalid output schema when both projected schema and filter are set

via GitHub Mon, 15 May 2023 10:23:21 -0700


agoncharuk opened a new issue, #35595:
URL: https://github.com/apache/arrow/issues/35595


   ### Describe the bug, including details regarding any error messages, 
version, and platform.
   
   Hello, Arrow community!
   I think I am facing what appears to be a bug in Arrow 12 when using 
`ParquetFileFormat`. The issue can be demonstrated with the following test (I 
am using a parquet file that was generated using DuckDB TPCH extension):
   ```
   #include <gtest/gtest.h>
   
   #include <arrow/filesystem/api.h>
   #include <arrow/dataset/file_parquet.h>
   
   std::shared_ptr<arrow::Schema> makeTestSchema(
       const std::vector<std::string>& colNames, const 
std::vector<std::shared_ptr<arrow::DataType>>& colTypes) {
       assert(colNames.size() == colTypes.size());
       arrow::FieldVector fields;
       for (int f = 0; f < colNames.size(); ++f) { 
fields.emplace_back(arrow::field(colNames[f], colTypes[f])); }
       return std::make_shared<arrow::Schema>(std::move(fields));
   }
   
   TEST(FileFormatTest, TestProjectionAndFilter) {
       auto schema = makeTestSchema(
           {
               "c_custkey",
               "c_name",
               "c_address",
               "c_nationkey",
               "c_phone",
               "c_acctbal",
               "c_mktsegment",
               "c_comment",
           },
           {
               arrow::int32(),
               arrow::utf8(),
               arrow::utf8(),
               arrow::int32(),
               arrow::utf8(),
               arrow::decimal128(15, 2),
               arrow::utf8(),
               arrow::utf8(),
           });
   
       auto descr = arrow::dataset::ProjectionDescr::FromNames(
           {"c_custkey", "c_nationkey", "c_name", "c_address"}, 
*schema.get())ValueOrDie();
   
       auto scanOpts = std::make_shared<arrow::dataset::ScanOptions>();
       scanOpts->projected_schema = descr.schema;
       scanOpts->projection = descr.expression;
   
       auto unbound = arrow::compute::call(
           "equal", 
           {arrow::compute::field_ref(arrow::FieldRef{"c_name"}), 
arrow::compute::literal("Customer#000001186")});
       scanOpts->filter = unbound.Bind(*schema).ValueOrDie();
   
       auto fs = std::make_shared<arrow::fs::LocalFileSystem>();
       auto format = std::make_shared<arrow::dataset::ParquetFileFormat>();
       auto file = "testing/tpch/customer/part.0.parquet";
   
       arrow::dataset::FileSource source(file, fs);
       auto fragment = format->MakeFragment(source, schema).ValueOrDie();
   
       auto batchGenerator = fragment->ScanBatchesAsync(scanOpts).ValueOrDie();
       auto batch = batchGenerator().result().ValueOrDie();
       ASSERT_TRUE(batch != nullptr);
       EXPECT_EQ(4, batch->columns().size());
       EXPECT_TRUE(arrow::int32()->Equals(*batch->column(0)->type()));
       EXPECT_TRUE(arrow::int32()->Equals(*batch->column(1)->type()));
       EXPECT_TRUE(arrow::utf8()->Equals(*batch->column(2)->type()));
       EXPECT_TRUE(arrow::utf8()->Equals(*batch->column(3)->type()));
   } 
   ```
   The test fails because the returned batch has types `{string, int32, int32, 
string}` instead of expected `{int32, in32, string, string}`.
   After a quick debug, I see that `InferColumnProjection` in 
`file_parquet.cpp` returns duplicated projected columns because it does not 
handle duplicates from `ScanOptions::MaterializedFields()`, which in turn 
returns a union of fields used in a filter and a projection, in that order 
(this is an expected behavior according to the documentation).
   Another thing that is not clear to me is that `InferColumnProjection` 
returns indices for 5 fields, while the resulting batch generator produces 
batches with 4 columns: I did not catch where an extra column is truncated.
   
   A few questions: 
    * Is this indeed a bug and my use of the API is correct, are there any 
workarounds for this?
    * Where is the logic that truncates 5 fields of inferred schema to 4 fields 
returned from the batch generator?
    * If this is a bug, what would be a correct fix (I do not mind contributing 
one)? I assume that `InferColumnProjection` should take into account duplicated 
column refs, and also `ScanOptions::MaterializedFields()` should return 
projected columns first, and filtered columns last.
   
   ### Component(s)
   
   C++, Parquet


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] agoncharuk opened a new issue, #35595: [C++] Dataset: Invalid output schema when both projected schema and filter are set

Reply via email to