Copilot commented on code in PR #50157:
URL: https://github.com/apache/arrow/pull/50157#discussion_r3401111504
##########
cpp/src/parquet/schema.h:
##########
@@ -382,6 +382,12 @@ class PARQUET_EXPORT ColumnDescriptor {
return la ? GetSortOrder(la, pt) : GetSortOrder(converted_type(), pt);
}
+ // Whether ColumnOrder-governed min/max values have a supported ordering.
+ bool can_use_stats() const {
+ return column_order().get_order() == ColumnOrder::TYPE_DEFINED_ORDER &&
+ sort_order() != SortOrder::UNKNOWN;
+ }
Review Comment:
`ColumnDescriptor::can_use_stats()` is easy to misinterpret as “any
statistics are usable”. However it currently returns false for
`ColumnOrder::UNDEFINED`, even though legacy chunk min/max statistics can still
be used in that case. Since this is a new public method, consider renaming it
to reflect that it only indicates whether *ordered / V2 min/max* can be trusted
(TYPE_DEFINED_ORDER + known SortOrder).
##########
cpp/src/parquet/schema_test.cc:
##########
@@ -660,6 +660,22 @@ TEST(TestColumnDescriptor, TestAttrs) {
ASSERT_EQ(expected_descr, descr2.ToString());
}
+TEST(TestColumnDescriptor, CanUseStats) {
+ NodePtr node = Int32("name");
+ ColumnDescriptor descr(node, 0, 0);
+ EXPECT_TRUE(descr.can_use_stats());
+
+ auto primitive_node = std::static_pointer_cast<PrimitiveNode>(node);
+ primitive_node->SetColumnOrder(ColumnOrder::undefined_);
+ EXPECT_FALSE(ColumnDescriptor(node, 0, 0).can_use_stats());
+
+ primitive_node->SetColumnOrder(ColumnOrder::unknown_);
+ EXPECT_FALSE(ColumnDescriptor(node, 0, 0).can_use_stats());
+
+ node = PrimitiveNode::Make("name", Repetition::REQUIRED, Type::INT96);
+ EXPECT_FALSE(ColumnDescriptor(node, 0, 0).can_use_stats());
+}
Review Comment:
Test uses the old `can_use_stats()` name. If the helper is renamed for
clarity, the test should be updated accordingly.
##########
cpp/src/parquet/metadata.cc:
##########
@@ -336,12 +359,8 @@ class ColumnChunkMetaData::ColumnChunkMetaDataImpl {
{
const std::lock_guard<std::mutex> guard(stats_mutex_);
if (possible_encoded_stats_ == nullptr) {
- possible_encoded_stats_ =
-
std::make_shared<EncodedStatistics>(FromThrift(column_metadata_->statistics));
- if (descr_->sort_order() == SortOrder::UNKNOWN) {
- // If the column SortOrder is Unknown we can't trust max/min.
- possible_encoded_stats_->ClearMinMax();
- }
+ possible_encoded_stats_ = std::make_shared<EncodedStatistics>(
+ FromThrift(column_metadata_->statistics,
GetStatisticsMinMaxField(*descr_)));
}
Review Comment:
PR description mentions ignoring *page index* min/max when column order is
missing/unsupported or sort order is unknown, but the new guard here only
affects chunk statistics (EncodedStatistics / Statistics). I don’t see any
corresponding ColumnOrder/SortOrder gating in the page index reader/decoding
path, so page index min/max may still be trusted incorrectly for
UNKNOWN/UNDEFINED ColumnOrder. Either implement the page-index-side guard or
adjust the PR description/scope.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]