Copilot commented on code in PR #50157:
URL: https://github.com/apache/arrow/pull/50157#discussion_r3401111504


##########
cpp/src/parquet/schema.h:
##########
@@ -382,6 +382,12 @@ class PARQUET_EXPORT ColumnDescriptor {
     return la ? GetSortOrder(la, pt) : GetSortOrder(converted_type(), pt);
   }
 
+  // Whether ColumnOrder-governed min/max values have a supported ordering.
+  bool can_use_stats() const {
+    return column_order().get_order() == ColumnOrder::TYPE_DEFINED_ORDER &&
+           sort_order() != SortOrder::UNKNOWN;
+  }

Review Comment:
   `ColumnDescriptor::can_use_stats()` is easy to misinterpret as “any 
statistics are usable”. However it currently returns false for 
`ColumnOrder::UNDEFINED`, even though legacy chunk min/max statistics can still 
be used in that case. Since this is a new public method, consider renaming it 
to reflect that it only indicates whether *ordered / V2 min/max* can be trusted 
(TYPE_DEFINED_ORDER + known SortOrder).



##########
cpp/src/parquet/schema_test.cc:
##########
@@ -660,6 +660,22 @@ TEST(TestColumnDescriptor, TestAttrs) {
   ASSERT_EQ(expected_descr, descr2.ToString());
 }
 
+TEST(TestColumnDescriptor, CanUseStats) {
+  NodePtr node = Int32("name");
+  ColumnDescriptor descr(node, 0, 0);
+  EXPECT_TRUE(descr.can_use_stats());
+
+  auto primitive_node = std::static_pointer_cast<PrimitiveNode>(node);
+  primitive_node->SetColumnOrder(ColumnOrder::undefined_);
+  EXPECT_FALSE(ColumnDescriptor(node, 0, 0).can_use_stats());
+
+  primitive_node->SetColumnOrder(ColumnOrder::unknown_);
+  EXPECT_FALSE(ColumnDescriptor(node, 0, 0).can_use_stats());
+
+  node = PrimitiveNode::Make("name", Repetition::REQUIRED, Type::INT96);
+  EXPECT_FALSE(ColumnDescriptor(node, 0, 0).can_use_stats());
+}

Review Comment:
   Test uses the old `can_use_stats()` name. If the helper is renamed for 
clarity, the test should be updated accordingly.



##########
cpp/src/parquet/metadata.cc:
##########
@@ -336,12 +359,8 @@ class ColumnChunkMetaData::ColumnChunkMetaDataImpl {
     {
       const std::lock_guard<std::mutex> guard(stats_mutex_);
       if (possible_encoded_stats_ == nullptr) {
-        possible_encoded_stats_ =
-            
std::make_shared<EncodedStatistics>(FromThrift(column_metadata_->statistics));
-        if (descr_->sort_order() == SortOrder::UNKNOWN) {
-          // If the column SortOrder is Unknown we can't trust max/min.
-          possible_encoded_stats_->ClearMinMax();
-        }
+        possible_encoded_stats_ = std::make_shared<EncodedStatistics>(
+            FromThrift(column_metadata_->statistics, 
GetStatisticsMinMaxField(*descr_)));
       }

Review Comment:
   PR description mentions ignoring *page index* min/max when column order is 
missing/unsupported or sort order is unknown, but the new guard here only 
affects chunk statistics (EncodedStatistics / Statistics). I don’t see any 
corresponding ColumnOrder/SortOrder gating in the page index reader/decoding 
path, so page index min/max may still be trusted incorrectly for 
UNKNOWN/UNDEFINED ColumnOrder. Either implement the page-index-side guard or 
adjust the PR description/scope.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to