[GitHub] [arrow] westonpace commented on a diff in pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

via GitHub Tue, 13 Jun 2023 07:23:23 -0700


westonpace commented on code in PR #35798:
URL: https://github.com/apache/arrow/pull/35798#discussion_r1228209673



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -248,7 +283,12 @@ Result<std::vector<int>> InferColumnProjection(const 
parquet::arrow::FileReader&
   }
 
   std::vector<int> columns_selection;
-  for (const auto& ref : field_refs) {
+  for (auto& ref : field_refs) {
+    // In the (unlikely) absence of a known dataset schema, we require that all
+    // materialized refs are named.
+    if (options.dataset_schema) {

Review Comment:
   Heh, at this point, if we don't have a dataset schema, I think something has 
gone very wrong.



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -224,6 +224,41 @@ Status ResolveOneFieldRef(
   return Status::OK();
 }
 
+bool IsNamedFieldRef(const FieldRef& ref) {
+  if (ref.IsName()) return true;
+  if (const auto* nested_refs = ref.nested_refs()) {
+    for (const auto& nested_ref : *nested_refs) {
+      if (!nested_ref.IsName()) return false;
+    }
+    return true;
+  }
+  return false;
+}

Review Comment:
   A minor thing but I wonder if we might want to add this directly to 
`FieldRef`?



##########
cpp/src/arrow/dataset/file_parquet_test.cc:
##########
@@ -694,5 +694,55 @@ TEST(TestParquetStatistics, NullMax) {
   EXPECT_EQ(stat_expression->ToString(), "(x >= 1)");
 }
 
+// Tests round-trip projection with nested/indexed FieldRefs
+// https://github.com/apache/arrow/issues/35579
+TEST(TestRoundTrip, ProjectedFieldRefs) {

Review Comment:
   We have some base classes / base tests that do some of this boilerplate 
already.  However, I don't recall off the top of my head how extensible they 
are (e.g. can you use them to test with this custom schema).  Did you take a 
look at the existing tests to see if you could modify one instead of the entire 
round trip?



##########
cpp/src/arrow/dataset/file_parquet.cc:
##########
@@ -224,6 +224,41 @@ Status ResolveOneFieldRef(
   return Status::OK();
 }
 
+bool IsNamedFieldRef(const FieldRef& ref) {
+  if (ref.IsName()) return true;
+  if (const auto* nested_refs = ref.nested_refs()) {
+    for (const auto& nested_ref : *nested_refs) {
+      if (!nested_ref.IsName()) return false;
+    }
+    return true;
+  }
+  return false;
+}
+
+// Converts a field ref into a position-independent ref (containing only a 
sequence of
+// names) based on the dataset schema. Returns `false` if no conversion was 
needed.
+Status MaybeConvertFieldRef(const FieldRef& ref, const Schema& dataset_schema,

Review Comment:
   Minor nit: Can we return `Result<FieldRef>` instead?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] westonpace commented on a diff in pull request #35798: GH-35579: [C++] Support non-named FieldRefs in Parquet scanner

Reply via email to