linliu-code opened a new issue, #18791:
URL: https://github.com/apache/hudi/issues/18791
## Describe the problem you faced
When Hudi's data-skipping (column-stats index) is enabled, a `LIKE
'prefix%'` predicate (Spark Catalyst `StartsWith`) silently drops rows in
`1.1.x` and `master`. The same query worked correctly in `0.15.0` and
`0.15.1-rc1`, so this is a regression introduced in the 1.x line.
Root cause is in the predicate translation: for `StartsWith(col, 'X')` Hudi
generates `colMin <= 'X' AND 'X' <= colMax`, which only matches files where the
**single-character literal** `'X'` happens to fall lexicographically inside
`[min, max]`. For any file that contains multi-character values starting with
`'X'`, the min is *greater* than `'X'` (because `'X_anything'.compareTo('X') >
0`), so the file is pruned even though it contains matching rows.
This is silent data loss at query time — no error, no warning, just an empty
result set.
## To Reproduce
Single-file pyspark script — no Docker required.
```bash
export HUDI_BUNDLE=/path/to/hudi-spark3.4-bundle_2.12-1.1.1.jar
spark-submit \
--master 'local[2]' \
--jars "$HUDI_BUNDLE" \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryo.registrator=org.apache.spark.HoodieSparkKryoRegistrar \
--conf
spark.sql.extensions=org.apache.spark.sql.hudi.HoodieSparkSessionExtension \
--conf
spark.sql.catalog.spark_catalog=org.apache.spark.sql.hudi.catalog.HoodieCatalog
\
repro.py
```
```python
"""Reproduce StartsWith translation bug in Hudi data-skipping.
The table has NO NaN, NO null, NO truncated values, NO schema evolution.
Three files, each with 10 string values starting with a single distinct
character.
Query: LIKE 'a%' — should match the 10 rows in file 0.
"""
import os, tempfile
from pyspark.sql import SparkSession
from pyspark.sql.types import StructType, StructField, IntegerType,
StringType
ROOT = tempfile.mkdtemp(prefix="hudi_startswith_")
spark = (SparkSession.builder.appName("repro")
.config("spark.sql.shuffle.partitions","1").getOrCreate())
spark.sparkContext.setLogLevel("WARN")
schema = StructType([
StructField("rk", IntegerType(), False),
StructField("p", StringType(), False),
StructField("s_str", StringType(), True),
])
opts = {
"hoodie.table.name": "startswith_repro",
"hoodie.datasource.write.recordkey.field": "rk",
"hoodie.datasource.write.partitionpath.field": "p",
"hoodie.datasource.write.precombine.field": "rk",
"hoodie.datasource.write.table.type": "COPY_ON_WRITE",
"hoodie.parquet.small.file.limit": "0",
"hoodie.metadata.enable": "true",
"hoodie.metadata.index.column.stats.enable": "true",
"hoodie.metadata.index.column.stats.column.list": "s_str",
}
# 3 files in 1 partition. NO NaN / NO null. Every value starts with 'a' /
'b' / 'c'.
files = [
[(10000+k, "P", "a_" + format(k, "02d")) for k in range(10)],
[(20000+k, "P", "b_" + format(k, "02d")) for k in range(10)],
[(30000+k, "P", "c_" + format(k, "02d")) for k in range(10)],
]
for i, rows in enumerate(files):
spark.createDataFrame(rows,
schema).write.format("hudi").options(**opts).mode(
"overwrite" if i == 0 else "append").save(ROOT)
df_on =
spark.read.format("hudi").option("hoodie.enable.data.skipping","true").load(ROOT)
df_off =
spark.read.format("hudi").option("hoodie.enable.data.skipping","false").load(ROOT)
on1 = df_on.where("s_str LIKE 'a%'").count()
off1 = df_off.where("s_str LIKE 'a%'").count()
on2 = df_on.where("s_str = 'a_00'").count()
off2 = df_off.where("s_str = 'a_00'").count()
print(f"\n s_str LIKE 'a%' ON={on1} OFF={off1} (expected 10)")
print(f" s_str = 'a_00' ON={on2} OFF={off2} (expected 1)")
spark.stop()
```
## Expected behavior
With `hoodie.enable.data.skipping=true`, `LIKE 'a%'` should never return
fewer rows than with `=false`. Data-skipping is a transparent performance
optimization — it must never change query results.
```
s_str LIKE 'a%' ON=10 OFF=10 (expected 10)
s_str = 'a_00' ON=1 OFF=1 (expected 1)
```
## Actual behavior — silent zero-row result on 1.1.x and master
Against `hudi-spark3.4-bundle_2.12-1.1.1.jar`:
```
s_str LIKE 'a%' ON=0 OFF=10 (expected 10) <<< BUG (silent wrong
result)
s_str = 'a_00' ON=1 OFF=1 (expected 1)
```
Equality (`= 'a_00'`) works correctly. Only the prefix-match `LIKE 'a%'` is
broken. The same script returns the correct `ON=10` against `0.15.0`,
`0.15.1-rc1`, and `1.1.0`-class bundles where I have not yet verified — see
Cross-version Matrix.
## Cross-version Matrix
Same script, same Spark 3.4.3, only swapping `--jars`:
| Bundle | `LIKE 'a%'` ON | `LIKE 'a%'` OFF | Verdict |
|---|---|---|---|
| `hudi-spark3.4-bundle_2.12-0.15.0.jar` | **10** | 10 | works ✓ |
| `hudi-spark3.4-bundle_2.12-0.15.1-rc1.jar` | **10** | 10 | works ✓ |
| `hudi-spark3.4-bundle_2.12-1.1.1.jar` | **0** | 10 | **reproduces silent
wrong result** |
| `master HEAD` (1.3.0-SNAPSHOT) | **0** | 10 | **reproduces** |
So this is a **regression in 1.x**, not a long-standing latent bug. Earlier
0.x releases returned correct results for the same query.
## Root cause
`hudi-spark-datasource/hudi-spark-common/src/main/scala/org/apache/spark/sql/hudi/DataSkippingUtils.scala`,
around line 325 (master HEAD), the `StartsWith` case translates the predicate
using `genColumnValuesEqualToExpression` — the same helper used for `EqualTo`:
```scala
// Filter "colA like 'xxx%'"
// Translates to "colA_minValue <= xxx AND xxx <= colA_maxValue" for index
lookup
//
// NOTE: Since a) this operator matches strings by prefix and b) given that
this column is going to be ordered
// lexicographically, we essentially need to check that provided
literal falls w/in min/max bounds of the
// given column
case StartsWith(sourceExpr @ AllowedTransformationExpression(attrRef), v @
Literal(_: UTF8String, _)) =>
getTargetIndexedColumnName(attrRef, indexedCols)
.map { colName =>
val targetExprBuilder: Expression => Expression =
swapAttributeRefInExpr(sourceExpr, attrRef, _)
genColumnValuesEqualToExpression(colName, v, targetExprBuilder) //
produces colMin <= V AND V <= colMax
}.orElse(Option.empty)
```
Which produces:
```
colMin <= 'a' AND 'a' <= colMax
```
That checks whether **the prefix literal itself** is in `[min, max]`. But
for prefix matching, the file matches if **any value** in the column starts
with the prefix. A file with `min='a_00'` and `max='a_09'` clearly matches
`LIKE 'a%'`, but:
- `min='a_00' <= 'a'` → `FALSE` (because `'a_00' > 'a'` lexicographically —
sharing the `'a'` prefix and extending further)
- The `AND` evaluates to false → file pruned
The translation is correct only in the degenerate case where the prefix
literal equals one of the actual stored values — i.e. for a single-character
prefix matching single-character values. For any longer values it produces
silently-wrong results.
## Suggested fix
For prefix `P`, a file with sorted range `[min, max]` may contain values
starting with `P` iff its range overlaps `[P, successor(P))`. That is:
```
max >= P AND min < successor(P)
```
where `successor(P)` is `P` with its last code point incremented (with carry
into preceding characters if the last is the max code point). For a single
ASCII letter `'a'`, `successor('a') = 'b'`.
A conservative simplification that's always safe (no false pruning) but
prunes less aggressively:
```
max >= P
```
This loses pruning on files whose `max` is greater than `successor(P)` (i.e.
files containing values lexicographically beyond the prefix range). But it
never wrongly prunes.
A correct full implementation would compute `successor(P)` for arbitrary
UTF-8 strings, handling the case where the last code point is `0x10FFFF` (the
maximum Unicode code point) by carrying into the previous character — or fall
back to the conservative `max >= P` form when overflow occurs.
The same translation issue applies to `Not(StartsWith(...))` (around line
338 of the same file): the corrected inversion should likewise reason about
prefix ranges, not the literal as a value.
## Environment Description
- Hudi version: **1.1.1** (current GA from Maven Central;
`hudi-spark3.4-bundle_2.12-1.1.1.jar`). Reproduces identically on master HEAD.
- Spark version: 3.4.3
- Hadoop version: 3 (bundled Spark distribution)
- Storage: local FS — bug is in the predicate translator and is
storage-independent
- Running on Docker?: optional
## Additional context
- This is a different bug from #18754 (NaN col-stats corruption). The two
were initially observed together but are independent: this bug reproduces with
**zero NaN** values, with **no nulls**, and with **no truncated strings**. The
values are well-behaved, short, ASCII strings.
- Equality predicates (`col = 'X'`) work correctly because for an exact
match the literal IS expected to be in `[min, max]`.
- Range predicates (`col > 'X'`) work correctly because they correctly use
only `colMax > 'X'` (or `colMin > 'X'`) without the bracketing.
## Workaround available today
Disable data-skipping at query time:
```python
spark.read.format("hudi").option("hoodie.enable.data.skipping","false").load(path)
```
This defeats the purpose of the col-stats feature but guarantees correctness
for `LIKE 'X%'` queries.
## Stacktrace
n/a — silent wrong result, no exception, no warning, no log line.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]