suxiaogang223 opened a new issue, #2079:
URL: https://github.com/apache/orc/issues/2079

   # Issue Description:
   >From Hive 1.1.0 onwards, the column statistics will also record if there 
are any null values within the row group by setting the hasNull flag. The 
hasNull flag is used by ORC’s predicate pushdown to better answer ‘IS NULL’ 
queries.
   
   We encountered an issue with the C++ implementation of the ORC reader when 
handling ORC files written with version 0.12. Specifically, files written in 
this version do not include the hasNull field in the column statistics 
metadata. While the Java implementation of the ORC reader handles this 
gracefully by defaulting hasNull to true when the field is absent, the C++ 
implementation does not handle this scenario correctly.
   **This issue prevents predicates like IS NULL from being pushed down to the 
ORC reader!!! As a result, all rows in the file are filtered out, leading to 
incorrect query results :(**
   # Steps to Reproduce:
   1.   Use an ORC file written with ORC version 0.12 (without hasNull in its 
column statistics).
   2.   Attempt to read the file using the C++ ORC reader.
   ## Expected Behavior:
   The C++ ORC reader should default the hasNull field to true when it is 
absent, ensuring compatibility with older file versions.
   ## Observed Behavior:
   The C++ ORC reader default the hasNull field to false, resulting in 
incorrect metadata interpretation.
   
   # Comparison with Java Implementation:
   The Java implementation includes the following logic:
   ```java
   if (stats.hasHasNull()) {
       hasNull = stats.getHasNull();
   } else {
       hasNull = true;
   }
   ```
   In contrast, the C++ implementation directly uses the has_null value without 
any fallback logic:
   ```c++
   ColumnStatisticsImpl::ColumnStatisticsImpl(const proto::ColumnStatistics& 
pb) {
       stats_.setNumberOfValues(pb.number_of_values());
       stats_.setHasNull(pb.has_null());
   }
   ```
   # Suggested Fix:
   Introduce fallback logic in the C++ reader to set hasNull to true when the 
field is missing, similar to the Java implementation.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to