[GitHub] [spark] sadikovi commented on pull request #37419: [SPARK-39833][SQL] Fix a rare correctness issue with count() in the case of overlapping partition and data columns in Parquet DSv1

GitBox Thu, 11 Aug 2022 23:02:04 -0700


sadikovi commented on PR #37419:
URL: https://github.com/apache/spark/pull/37419#issuecomment-1212757388


   Not exactly, the filter actually references columns that exist in the file. 
It is the projection that matters in the code apparently. 
   
   Here is what they have in the javadoc:
   ```
      * @param paths
      *          the paths of the columns used in the actual projection; a 
column not being part of the projection will be
      *          handled as containing {@code null} values only even if the 
column has values written in the file
   ```
   
https://github.com/apache/parquet-mr/blob/0819356a9dafd2ca07c5eab68e2bffeddc3bd3d9/parquet-column/src/main/java/org/apache/parquet/internal/filter2/columnindex/ColumnIndexFilter.java#L80)
   
   I am not very familiar with the implementation but I think the library 
should be returning all rows instead of empty rows.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] sadikovi commented on pull request #37419: [SPARK-39833][SQL] Fix a rare correctness issue with count() in the case of overlapping partition and data columns in Parquet DSv1

Reply via email to