cccs-jc opened a new issue, #14995:
URL: https://github.com/apache/iceberg/issues/14995
### Apache Iceberg version
1.9.1
### Query engine
Spark
### Please describe the bug 🐞
## Description
There appears to be an issue in Apache Iceberg's data file pruning when
using the SQL `LIKE` operator containing escaped underscore (`_`) characters.
Unlike when using the Spark `startswith` function or simpler `LIKE` patterns
without underscores, the presence of escaped special characters in the `LIKE`
expression leads to improper query optimizations. As a result, all data files
are scanned instead of pruning irrelevant files.
The appropriate file pruning occurs for queries using the Spark `startswith`
function or `LIKE` expressions without escaped underscores but fails when
underscores are escaped manually (with `\_`) or using an explicit escape
character.
---
## Steps to Reproduce
1. Create an Iceberg table (e.g., `prod_catalog.admin.inventory`) with a
`TRUNCATE(128)` transformation on the `file_path` column:
```sql
ALTER TABLE prod_catalog.admin.inventory SET TBLPROPERTIES
('write.metadata.metrics.column.file_path'='truncate(128)');
```
2. Run the following query using the Spark `startswith` operator:
```sql
SELECT count(*)
FROM prod_catalog.admin.inventory
WHERE
loaded_at = DATE('2026-01-05') AND
startswith(file_path, 'warehouse/iceberg/good_facts/fact1_table/');
```
- **Execution Plan Observation:** File pruning is successful. Only 11
data files are scanned (`number of result data files: 11`).
3. Replace `startswith` with the SQL `LIKE` operator:
```sql
SELECT count(*)
FROM prod_catalog.admin.inventory
WHERE
loaded_at = DATE('2026-01-05') AND
file_path LIKE 'warehouse/iceberg/good_facts/fact1_table/%';
```
- **Execution Plan Observation:** File pruning fails, and all data files
are scanned.
4. Escape special characters in the `LIKE` expression:
- Using a backslash:
```sql
SELECT count(*)
FROM prod_catalog.admin.inventory
WHERE
loaded_at = DATE('2026-01-05') AND
file_path LIKE 'warehouse/iceberg/good\_facts/fact1\_table/%';
```
- Using an explicit escape character (`ESCAPE 'X'`):
```sql
SELECT count(*)
FROM prod_catalog.admin.inventory
WHERE
loaded_at = DATE('2026-01-05') AND
file_path LIKE 'warehouse/iceberg/goodX_facts/fact1X_table/%'
ESCAPE 'X';
```
- **Execution Plan Observation:** In both cases, file pruning still
fails, and all data files are scanned.
5. Finally, remove the underscore to see what happens:
```sql
SELECT count(*)
FROM prod_catalog.admin.inventory
WHERE
loaded_at = DATE('2026-01-05') AND
file_path LIKE 'warehouse/iceberg/goodfacts/fact1table/%';
```
- **Execution Plan Observation:** File pruning works correctly. The
`LIKE` operator is converted into `StartsWith` during the Filter phase, and
only relevant data files are scanned.
---
## Expected Behavior
- The presence of escaped underscores in the `LIKE` operator (via `\_` or
`ESCAPE`) should not prevent Iceberg from optimizing data file pruning. Iceberg
should recognize the `LIKE` pattern and convert it to an equivalent
`StartsWith` expression for efficient pruning.
---
## Observed Behavior
- Iceberg does not optimize file pruning when the `LIKE` operator includes
escaped special characters like underscores (`_`), leading to all files being
scanned. This behavior is inconsistent with both `startswith` and simpler
`LIKE` patterns that do not contain escaped characters.
---
## Additional Information
- When the pattern contains no underscores, Iceberg correctly converts the
`LIKE` expression to a `StartsWith` during the Filter phase.
- Example of an unoptimized filter condition:
```
(3) Filter [codegen id : 1]
Input [2]: [loaded_at#811, file_path#817]
Condition : ((isnotnull(file_path#817) AND (loaded_at#811 = 2026-01-05)) AND
file_path#817 LIKE warehouse/iceberg/goodX_facts/fact1X_table/ ESCAPE 'X')
```
- Example of an optimized filter condition when underscores are removed:
```
(3) Filter [codegen id : 1]
Input [2]: [loaded_at#844, file_path#850]
Condition : ((isnotnull(file_path#850) AND (loaded_at#844 = 2026-01-05)) AND
StartsWith(file_path#850, warehouse/iceberg/goodfacts/fact1table/))
```
### Willingness to contribute
- [ ] I can contribute a fix for this bug independently
- [ ] I would be willing to contribute a fix for this bug with guidance from
the Iceberg community
- [x] I cannot contribute a fix for this bug at this time
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]