Livia Zhu created SPARK-50492:
---------------------------------
Summary: Fix java.util.NoSuchElementException when watermark
column is dropped after dropDuplicatesWithinWatermark
Key: SPARK-50492
URL: https://issues.apache.org/jira/browse/SPARK-50492
Project: Spark
Issue Type: Task
Components: Structured Streaming
Affects Versions: 4.0.0
Reporter: Livia Zhu
Consider the following query:
```
val result = inputData.toDF()
.select("_1", "_2")
.withColumn("timestamp", to_timestamp($"_2", "yyyy-MM-dd HH:mm:ss"))
.withWatermark("timestamp", "24 hours")
.dropDuplicatesWithinWatermark("timestamp")
.select("_1")[]
```
Currently, the ColumnPruning optimization will prune the `timestamp` column
since it is not selected in the final Project, leading to a
`java.util.NoSuchElementException` when we try to get the event time column in
DeduplicateWithinWatermarkExec.
We need to update the references for the DeduplicateWithinWatermark logical
plan node so that the event time column is included in the references.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]