suxiaogang223 opened a new pull request, #2082:
URL: https://github.com/apache/orc/pull/2082

   <!--
   Thanks for sending a pull request!  Here are some tips for you:
     1. File a JIRA issue first and use it as a prefix of your PR title, e.g., 
`ORC-001: Fix ABC`.
     2. Use your PR title to summarize what this PR proposes instead of 
describing the problem.
     3. Make PR title and description complete because these will be the 
permanent commit log.
     4. If possible, provide a concise and reproducible example to reproduce 
the issue for a faster review.
     5. If the PR is unfinished, use GitHub PR Draft feature.
   -->
   ### What changes were proposed in this pull request?
   cose issue: #2079 
   relate pr: #2055
   Introduce fallback logic in the C++ reader to set hasNull to true when the 
field is missing, similar to the Java implementation.
   The Java implementation includes the following logic:
   ```java
   if (stats.hasHasNull()) {
       hasNull = stats.getHasNull();
   } else {
       hasNull = true;
   }
   ```
   In contrast, the C++ implementation directly uses the has_null value without 
any fallback logic:
   ```c++
   ColumnStatisticsImpl::ColumnStatisticsImpl(const proto::ColumnStatistics& 
pb) {
       stats_.setNumberOfValues(pb.number_of_values());
       stats_.setHasNull(pb.has_null());
   }
   ```
   ### Why are the changes needed?
   We encountered an issue with the C++ implementation of the ORC reader when 
handling ORC files written with version 0.12. Specifically, files written in 
this version do not include the hasNull field in the column statistics 
metadata. While the Java implementation of the ORC reader handles this 
gracefully by defaulting hasNull to true when the field is absent, the C++ 
implementation does not handle this scenario correctly.
   **This issue prevents predicates like IS NULL from being pushed down to the 
ORC reader!!! As a result, all rows in the file are filtered out, leading to 
incorrect query results :(**
   ### How was this patch tested?
   I have tested this using [Doris](https://github.com/apache/doris) external 
pipeline: 
   https://github.com/apache/doris/pull/45104
   https://github.com/apache/doris-thirdparty/pull/259
   
   ### Was this patch authored or co-authored using generative AI tooling?
   No
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to