majian1998 commented on code in PR #10389:
URL: https://github.com/apache/hudi/pull/10389#discussion_r1444122277
##########
hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/hudi/ColumnStatsIndexSupport.scala:
##########
@@ -272,11 +272,13 @@ class ColumnStatsIndexSupport(spark: SparkSession,
// NOTE: This could occur in either of the following cases:
// 1. Particular file does not have this particular
column (which is indexed by Column Stats Index):
// in this case we're assuming missing column to
essentially contain exclusively
- // null values, we set min/max values as null and
null-count to be equal to value-count (this
+ // null values, we set min/maxăand null-count values
as null (this
// behavior is consistent with reading non-existent
columns from Parquet)
+ // 2. When evaluating non-null index conditions, a
condition has been added to check if null-count equals null;
+ // this suggests that we are uncertain whether the
column is empty or not, and if so, we return True.
//
// This is a way to determine current column's index without
explicit iteration (we're adding 3 stats / column)
- acc ++= Seq(null, null, valueCount)
+ acc ++= Seq(null, null, null)
Review Comment:
Old behavior: The null count for fields without column stats was set to
their value count, to ensure consistent non-null checks in the current index
filter by comparing if null_count < value_count. If a column could be indexed
but did not exist column stats, it was considered empty.
New behavior: The null count for fields without column stats is set to null.
Additionally, non-null checks in the index filter now include an extra
condition where if the null count is null, the non-null check returns true.
This change is due to the fact that types not supported when the index column
is not specified may still be present in the list of indexed columns but lack
column stats. Under the old behavior, these would be incorrectly judged as
empty columns, when in fact they are not empty.
These are the complete behavioral changes introduced by this patch.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]