[
https://issues.apache.org/jira/browse/SPARK-53684?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Kazuyuki Tanimura updated SPARK-53684:
--------------------------------------
Description:
According to Spark UI, the following excerpt of the physical plan is shown:
(5) Filter [codegen id : 1]
Input [16]: [a#241, b#243L, c#248, d#250, e#251, f#252, g#253L, h#258, i#259,
j#272, k#277, l#286, m#326, n#388, o#394, p#404|#241, b#243L, c#248, d#250,
e#251, f#252, g#253L, h#258, i#259, j#272, k#277, l#286, m#326, n#388, o#394,
p#404]
Condition : ((((NOT b#243L IN (0,-1) AND CASE WHEN isnull(i#259) THEN false
WHEN (i#259 = 0) THEN false WHEN (i#259 = 1) THEN false WHEN (i#259 = 2) THEN
true ELSE false END) AND (isnull(o#394) OR ((NOT Contains(o#394, TAG4) AND NOT
Contains(o#394, TAG3)) AND NOT Contains(o#394, TAG2)))) AND (isnull(p#404) OR
(((NOT Contains(p#404, TAG4) AND NOT Contains(p#404, TAG3)) AND NOT
Contains(p#404, TAG2)) AND NOT Contains(p#404, TAG1)))) AND
(date_format(gettimestamp(date_format(gettimestamp(date_format(cast(n#388 as
timestamp), yyyy-MM-dd, Some(UTC)), yyyy-MM-dd, TimestampType, Some(UTC),
false), yyyy-MM-dd, Some(UTC)), yyyy-MM-dd, TimestampType, Some(UTC), false),
yyyy-MM-dd, Some(UTC)) >= 2025-08-26))
The last part of the filter should be able to be reduced further:
(date_format(gettimestamp(date_format(gettimestamp(date_format(cast(n#388 as
timestamp), yyyy-MM-dd, Some(UTC)), yyyy-MM-dd, TimestampType, Some(UTC),
false), yyyy-MM-dd, Some(UTC)), yyyy-MM-dd, TimestampType, Some(UTC), false),
yyyy-MM-dd, Some(UTC)) >= 2025-08-26
—> date_format(cast(n#388 as timestamp), yyyy-MM-dd, Some(UTC) >= 2025-08-26
The simplification happens for Parquet data, but not for iceberg
was:
According to Spark UI, the following excerpt of the physical plan is shown:
(5) Filter [codegen id : 1]
Input [16]: [a#241, b#243L, c#248, d#250, e#251, f#252, g#253L, h#258, i#259,
j#272, k#277, l#286, m#326, n#388, o#394, p#404|#241, b#243L, c#248, d#250,
e#251, f#252, g#253L, h#258, i#259, j#272, k#277, l#286, m#326, n#388, o#394,
p#404]
Condition : ((((NOT b#243L IN (0,-1) AND CASE WHEN isnull(i#259) THEN false
WHEN (i#259 = 0) THEN false WHEN (i#259 = 1) THEN false WHEN (i#259 = 2) THEN
true ELSE false END) AND (isnull(o#394) OR ((NOT Contains(o#394, ML_) AND NOT
Contains(o#394, TDD_)) AND NOT Contains(o#394, POLICY_)))) AND (isnull(p#404)
OR (((NOT Contains(p#404, ML_) AND NOT Contains(p#404, TDD_)) AND NOT
Contains(p#404, POLICY_)) AND NOT Contains(p#404, NEW_USER_3_DAYS)))) AND
(date_format(gettimestamp(date_format(gettimestamp(date_format(cast(n#388 as
timestamp), yyyy-MM-dd, Some(UTC)), yyyy-MM-dd, TimestampType, Some(UTC),
false), yyyy-MM-dd, Some(UTC)), yyyy-MM-dd, TimestampType, Some(UTC), false),
yyyy-MM-dd, Some(UTC)) >= 2025-08-26))
The last part of the filter should be able to be reduced further:
(date_format(gettimestamp(date_format(gettimestamp(date_format(cast(n#388 as
timestamp), yyyy-MM-dd, Some(UTC)), yyyy-MM-dd, TimestampType, Some(UTC),
false), yyyy-MM-dd, Some(UTC)), yyyy-MM-dd, TimestampType, Some(UTC), false),
yyyy-MM-dd, Some(UTC)) >= 2025-08-26
—> date_format(cast(n#388 as timestamp), yyyy-MM-dd, Some(UTC) >= 2025-08-26
The simplification happens for Parquet data, but not for iceberg
> Spark is not simplifying some expressions for Iceberg
> -----------------------------------------------------
>
> Key: SPARK-53684
> URL: https://issues.apache.org/jira/browse/SPARK-53684
> Project: Spark
> Issue Type: Improvement
> Components: SQL
> Affects Versions: 4.1.0, 3.5.7, 4.0.2
> Reporter: Kazuyuki Tanimura
> Priority: Major
>
> According to Spark UI, the following excerpt of the physical plan is shown:
> (5) Filter [codegen id : 1]
> Input [16]: [a#241, b#243L, c#248, d#250, e#251, f#252, g#253L, h#258, i#259,
> j#272, k#277, l#286, m#326, n#388, o#394, p#404|#241, b#243L, c#248, d#250,
> e#251, f#252, g#253L, h#258, i#259, j#272, k#277, l#286, m#326, n#388, o#394,
> p#404]
> Condition : ((((NOT b#243L IN (0,-1) AND CASE WHEN isnull(i#259) THEN false
> WHEN (i#259 = 0) THEN false WHEN (i#259 = 1) THEN false WHEN (i#259 = 2) THEN
> true ELSE false END) AND (isnull(o#394) OR ((NOT Contains(o#394, TAG4) AND
> NOT Contains(o#394, TAG3)) AND NOT Contains(o#394, TAG2)))) AND
> (isnull(p#404) OR (((NOT Contains(p#404, TAG4) AND NOT Contains(p#404, TAG3))
> AND NOT Contains(p#404, TAG2)) AND NOT Contains(p#404, TAG1)))) AND
> (date_format(gettimestamp(date_format(gettimestamp(date_format(cast(n#388 as
> timestamp), yyyy-MM-dd, Some(UTC)), yyyy-MM-dd, TimestampType, Some(UTC),
> false), yyyy-MM-dd, Some(UTC)), yyyy-MM-dd, TimestampType, Some(UTC), false),
> yyyy-MM-dd, Some(UTC)) >= 2025-08-26))
> The last part of the filter should be able to be reduced further:
> (date_format(gettimestamp(date_format(gettimestamp(date_format(cast(n#388 as
> timestamp), yyyy-MM-dd, Some(UTC)), yyyy-MM-dd, TimestampType, Some(UTC),
> false), yyyy-MM-dd, Some(UTC)), yyyy-MM-dd, TimestampType, Some(UTC), false),
> yyyy-MM-dd, Some(UTC)) >= 2025-08-26
> —> date_format(cast(n#388 as timestamp), yyyy-MM-dd, Some(UTC) >= 2025-08-26
>
> The simplification happens for Parquet data, but not for iceberg
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]