Re: [PR] [HUDI-7246] Fix Data Skipping Issue: No Results When Query Conditions Involve Both Columns with and without Column Stats [hudi]

via GitHub Fri, 29 Dec 2023 19:42:47 -0800


danny0405 commented on code in PR #10389:
URL: https://github.com/apache/hudi/pull/10389#discussion_r1438465730



##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala:
##########
@@ -272,11 +272,13 @@ class ColumnStatsIndexSupport(spark: SparkSession,
                   // NOTE: This could occur in either of the following cases:
                   //    1. Particular file does not have this particular 
column (which is indexed by Column Stats Index):
                   //       in this case we're assuming missing column to 
essentially contain exclusively
-                  //       null values, we set min/max values as null and 
null-count to be equal to value-count (this
+                  //       null values, we set min/max、and null-count values 
as null (this
                   //       behavior is consistent with reading non-existent 
columns from Parquet)
+                  //    2. When evaluating non-null index conditions, a 
condition has been added to check if null-count equals null;
+                  //       this suggests that we are uncertain whether the 
column is empty or not, and if so, we return True.
                   //
                   // This is a way to determine current column's index without 
explicit iteration (we're adding 3 stats / column)
-                  acc ++= Seq(null, null, valueCount)
+                  acc ++= Seq(null, null, null)

Review Comment:
   > behavior is consistent with reading non-existent columns from Parquet
   
   Did you ever dig into the reasons why we follow the parquet behavior before?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] [HUDI-7246] Fix Data Skipping Issue: No Results When Query Conditions Involve Both Columns with and without Column Stats [hudi]

Reply via email to