etseidl commented on code in PR #45285:
URL: https://github.com/apache/arrow/pull/45285#discussion_r1918852772


##########
cpp/src/parquet/size_statistics.cc:
##########
@@ -64,14 +64,19 @@ void 
SizeStatistics::IncrementUnencodedByteArrayDataBytes(int64_t value) {
 }
 
 void SizeStatistics::Validate(const ColumnDescriptor* descr) const {
-  if (repetition_level_histogram.size() !=
-      static_cast<size_t>(descr->max_repetition_level() + 1)) {
-    throw ParquetException("Repetition level histogram size mismatch");
-  }
-  if (definition_level_histogram.size() !=
-      static_cast<size_t>(descr->max_definition_level() + 1)) {
-    throw ParquetException("Definition level histogram size mismatch");
-  }
+  auto validate_histogram = [](const std::vector<int64_t>& histogram, int16_t 
max_level,
+                               const std::string& name) {
+    if (max_level == 0 && histogram.empty()) {
+      return;
+    }
+    if (histogram.size() != static_cast<size_t>(max_level + 1)) {
+      throw ParquetException(name + " level histogram size mismatch");
+    }
+  };
+  validate_histogram(repetition_level_histogram, descr->max_repetition_level(),
+                     "Repetition");
+  validate_histogram(definition_level_histogram, descr->max_definition_level(),
+                     "Definition");

Review Comment:
   You may want to add a threshold argument to `validate_histogram`, the reason 
being that the definition level histogram can be omitted for max def equal 0 
_or_ 1. arrow-rs and cudf both use the same cutoff of `0` on write (not sure 
what parquet-java does, but you wrote that 😄), but down the road someone might 
stick to the spec and this test will still fail.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to