wgtmac commented on code in PR #34355:
URL: https://github.com/apache/arrow/pull/34355#discussion_r1118201762
##########
cpp/src/parquet/statistics.cc:
##########
@@ -495,9 +495,12 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
bool has_null_count, bool has_distinct_count,
MemoryPool* pool)
: TypedStatisticsImpl(descr, pool) {
TypedStatisticsImpl::IncrementNumValues(num_values);
+ // Currently, `has_null_count` argument is not used.
+ // Internal has_null_count_ would always be true.
Review Comment:
IMO, if `has_null_count` is not used we should remove it to reduce confusion
and future misuse. Otherwise, we should respect it from the input argument.
##########
cpp/src/parquet/statistics.cc:
##########
@@ -620,7 +626,11 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
private:
const ColumnDescriptor* descr_;
bool has_min_max_ = false;
+ // Currently, has_null_count_ is always true, and would
Review Comment:
This is confusing because the default value below is `false`
##########
cpp/src/parquet/statistics_test.cc:
##########
@@ -377,6 +377,35 @@ class TestStatistics : public PrimitiveTypedTest<TestType>
{
ASSERT_EQ(total->max(), std::max(statistics1->max(), statistics2->max()));
}
+ void TestMergeEmpty() {
Review Comment:
It would be better to add a case where either side does not have a valid
stats (min/max/null_count/distinct_count) meaning that the merged stats is
dropped.
##########
cpp/src/parquet/statistics.cc:
##########
@@ -620,7 +626,11 @@ class TypedStatisticsImpl : public TypedStatistics<DType> {
private:
const ColumnDescriptor* descr_;
bool has_min_max_ = false;
+ // Currently, has_null_count_ is always true, and would
+ // be collected and encoded to `EncodedStatistics`.
bool has_null_count_ = false;
+ // Currently, has_distinct_count_ would not be encoded
Review Comment:
It seems that `statistics_` already has a set of `hasXXX` variables. Can we
reuse those directly?
```cpp
class PARQUET_EXPORT EncodedStatistics {
std::shared_ptr<std::string> max_, min_;
bool is_signed_ = false;
public:
EncodedStatistics()
: max_(std::make_shared<std::string>()),
min_(std::make_shared<std::string>()) {}
int64_t null_count = 0;
int64_t distinct_count = 0;
bool has_min = false;
bool has_max = false;
bool has_null_count = false;
bool has_distinct_count = false;
```
##########
cpp/src/parquet/statistics.cc:
##########
@@ -553,10 +556,13 @@ class TypedStatisticsImpl : public TypedStatistics<DType>
{
void Merge(const TypedStatistics<DType>& other) override {
this->num_values_ += other.num_values();
if (other.HasNullCount()) {
- this->statistics_.null_count += other.null_count();
+ this->IncrementNullCount(other.null_count());
}
- if (other.HasDistinctCount()) {
- this->statistics_.distinct_count += other.distinct_count();
+ if (HasDistinctCount() && other.HasDistinctCount()) {
+ this->IncrementDistinctCount(other.distinct_count());
+ } else {
+ this->has_distinct_count_ = false;
+ this->has_distinct_count_ = 0;
Review Comment:
```suggestion
statistics_.distinct_count = 0;
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]