wgtmac commented on code in PR #35989:
URL: https://github.com/apache/arrow/pull/35989#discussion_r1236228812
##########
cpp/src/parquet/statistics.cc:
##########
@@ -472,34 +474,40 @@ class TypedStatisticsImpl : public TypedStatistics<DType>
{
comparator_ = std::static_pointer_cast<TypedComparator<DType>>(comp);
TypedStatisticsImpl::Reset();
has_null_count_ = true;
- has_distinct_count_ = true;
+ has_distinct_count_ = false;
Review Comment:
nit: add a comment here to explain why has_null_count_ and
has_distinct_count_ have different defaults as we have discussed here
##########
cpp/src/parquet/statistics.cc:
##########
@@ -472,34 +474,40 @@ class TypedStatisticsImpl : public TypedStatistics<DType>
{
comparator_ = std::static_pointer_cast<TypedComparator<DType>>(comp);
TypedStatisticsImpl::Reset();
has_null_count_ = true;
- has_distinct_count_ = true;
+ has_distinct_count_ = false;
}
+ // Currently, this function will tend to be called by testing.
Review Comment:
Ditto, we can say it is used to create stats from provided values.
##########
cpp/src/parquet/statistics.cc:
##########
@@ -463,6 +463,8 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
public:
using T = typename DType::c_type;
+ // Currently, this function will be called by ColumnWriter to create the
+ // statistics collector during write.
Review Comment:
I'd rather not to say who will call this. Instead, we can simply say it is
used to create an empty stats and update it from follow-up writes.
##########
cpp/src/parquet/statistics.cc:
##########
@@ -472,34 +474,40 @@ class TypedStatisticsImpl : public TypedStatistics<DType>
{
comparator_ = std::static_pointer_cast<TypedComparator<DType>>(comp);
TypedStatisticsImpl::Reset();
has_null_count_ = true;
- has_distinct_count_ = true;
+ has_distinct_count_ = false;
}
+ // Currently, this function will tend to be called by testing.
TypedStatisticsImpl(const T& min, const T& max, int64_t num_values, int64_t
null_count,
int64_t distinct_count)
: pool_(default_memory_pool()),
min_buffer_(AllocateBuffer(pool_, 0)),
max_buffer_(AllocateBuffer(pool_, 0)) {
TypedStatisticsImpl::IncrementNumValues(num_values);
TypedStatisticsImpl::IncrementNullCount(null_count);
- IncrementDistinctCount(distinct_count);
+ SetDistinctCount(distinct_count);
Copy(min, &min_, min_buffer_.get());
Copy(max, &max_, max_buffer_.get());
has_min_max_ = true;
}
+ // Currently, this function will tend to be called by ColumnChunkMetaData
during read.
Review Comment:
Ditto, we can say it is used to create stats from a thrift Statistics object.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]