[GitHub] [arrow] westonpace commented on a change in pull request #12664: ARROW-15658: [C++] Parquet pushdown filtering fails if the filter expression uses numeric field references

GitBox Thu, 24 Mar 2022 14:46:12 -0700


westonpace commented on a change in pull request #12664:
URL: https://github.com/apache/arrow/pull/12664#discussion_r834752675




##########
File path: cpp/examples/arrow/dataset_parquet_scan_example.cc
##########
@@ -62,9 +71,11 @@ struct Configuration {
 
   // Indicates the filter by which rows will be filtered. This optimization can
   // make use of partition information and/or file metadata if possible.
-  cp::Expression filter =
-      cp::greater(cp::field_ref("total_amount"), cp::literal(1000.0f));
-
+  // Equivalent filter with field_name instead of field_index

Review comment:
       ```suggestion
   // fields can be referenced by name or by index.  This example assumes the 
schema:
   ```

##########
File path: cpp/examples/arrow/dataset_parquet_scan_example.cc
##########
@@ -39,6 +39,15 @@ namespace ds = arrow::dataset;
 
 namespace cp = arrow::compute;
 
+// @brief Run Example

Review comment:
       ```suggestion
   // \brief Run Example
   ```

##########
File path: cpp/examples/arrow/dataset_parquet_scan_example.cc
##########
@@ -186,5 +197,9 @@ int main(int argc, char** argv) {
   auto table = GetTableFromScanner(scanner);
   std::cout << "Table size: " << table->num_rows() << "\n";
 
+  std::cout << "Data : " << std::endl;

Review comment:
       ```suggestion
     std::cout << "Data: " << std::endl;
   ```

##########
File path: cpp/examples/arrow/dataset_parquet_scan_example.cc
##########
@@ -62,9 +71,11 @@ struct Configuration {
 
   // Indicates the filter by which rows will be filtered. This optimization can

Review comment:
       ```suggestion
     // Indicates the expression by which rows will be filtered. This 
optimization can
   ```

##########
File path: cpp/src/arrow/dataset/file_parquet.cc
##########
@@ -160,18 +160,34 @@ Status ResolveOneFieldRef(
     const std::unordered_map<std::string, const SchemaField*>& field_lookup,
     const std::unordered_set<std::string>& duplicate_fields,
     std::vector<int>* columns_selection) {
-  if (const std::string* name = field_ref.name()) {
+  auto resolve_field_ref = [&](const std::string* name) -> Status {
     auto it = field_lookup.find(*name);
     if (it != field_lookup.end()) {
       AddColumnIndices(*it->second, columns_selection);
     } else if (duplicate_fields.find(*name) != duplicate_fields.end()) {
-      // We shouldn't generally get here because SetProjection will reject 
such references
+      // We shouldn't generally get here because SetProjection will reject such
+      // references
       return Status::Invalid("Ambiguous reference to column '", *name,
                              "' which occurs more than once");
     }
     // "Virtual" column: field is not in file but is in the ScanOptions.
     // Ignore it here, as projection will pad the batch with a null column.
     return Status::OK();
+  };
+
+  if (const std::string* name = field_ref.name()) {
+    return resolve_field_ref(name);
+  } else if (const FieldPath* path = field_ref.field_path()) {
+    auto indices = path->indices();
+    if (indices.size() > 1) {
+      return Status::NotImplemented("Provided FieldRef ", field_ref.ToString(),
+                                    ", Nested FieldPaths are not supported!");
+    }
+    int index = indices[0];
+    auto schema_field = manifest.schema_fields[index];

Review comment:
       `manifest.schema_fields` will give us the fields in file order.  But, an 
integral field reference is actually a reference into the dataset schema.  So I 
think we want to be using the dataset schema here instead of the file manifest.
   
   We should add a test case for this as it is a rather subtle point...
   
   ```
   Parquet File 1:
   x | y | z
   1 | 2 | 3
   ```
   
   ```
   Parquet File 2:
   z | y | x
   3 | 2 | 1
   ```
   
   ```
   Dataset Schema:
    - x: int32
    - y: int32
    - z: int32
   ```
   
   `field_ref(0)` should yield `[1, 1]`




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a change in pull request #12664: ARROW-15658: [C++] Parquet pushdown filtering fails if the filter expression uses numeric field references

Reply via email to