wgtmac commented on issue #14870:
URL: https://github.com/apache/arrow/issues/14870#issuecomment-1424793827

   @pitrou Could you please reopen this issue? I found that there is a missing 
part: The C++ parquet reader does not parse column statistics correctly here: 
https://github.com/apache/arrow/blob/master/cpp/src/parquet/column_reader.cc#L214
   ```cpp
   // Extracts encoded statistics from V1 and V2 data page headers
   template <typename H>
   EncodedStatistics ExtractStatsFromHeader(const H& header) {
     EncodedStatistics page_statistics;
     if (!header.__isset.statistics) {
       return page_statistics;
     }
     const format::Statistics& stats = header.statistics;
     if (stats.__isset.max) {
       page_statistics.set_max(stats.max);
     }
     if (stats.__isset.min) {
       page_statistics.set_min(stats.min);
     }
     if (stats.__isset.null_count) {
       page_statistics.set_null_count(stats.null_count);
     }
     if (stats.__isset.distinct_count) {
       page_statistics.set_distinct_count(stats.distinct_count);
     }
     return page_statistics;
   }
   
   ```
   
   It should check __isset.min_value and __isset.max_value first which is 
similar to parquet-mr: 
https://github.com/apache/parquet-mr/blob/5290bd5e0ee5dc30db0576e2bfc6eea335c465cf/parquet-hadoop/src/main/java/org/apache/parquet/format/converter/ParquetMetadataConverter.java#L797


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to