bersprockets opened a new pull request #35635:
URL: https://github.com/apache/spark/pull/35635
### What changes were proposed in this pull request?
Pass an `IndexedSeq` (likely a `Vector`) to
`ExtractWindowExpressions.extract` and `ExtractWindowExpressions.addWindow`
rather than whatever sequence type was specified by the user (in the
`Dataset.select` method).
To accomplish this, we only need to pass an `IndexedSeq` to
`ExtractWindowExpressions.extract`. `ExtractWindowExpressions.extract` will
then return another `IndexedSeq` that we will pass on to
`ExtractWindowExpressions.addWindow`
### Why are the changes needed?
Consider this query:
```
val df = spark.range(0, 20).map { x =>
(x % 4, x + 1, x + 2)
}.toDF("a", "b", "c")
import org.apache.spark.sql.expressions._
val w = Window.partitionBy("a").orderBy("b")
val selectExprs = Stream(
sum("c").over(w.rowsBetween(Window.unboundedPreceding,
Window.currentRow)).as("sumc"),
avg("c").over(w.rowsBetween(Window.unboundedPreceding,
Window.currentRow)).as("avgc")
)
df.select(selectExprs: _*).show(false)
```
It fails with
```
org.apache.spark.sql.AnalysisException: Resolved attribute(s) avgc#23
missing from c#16L,a#14L,b#15L,sumc#21L in operator !Project [c#16L, a#14L,
b#15L, sumc#21L, sumc#21L, avgc#23].;
```
If you change the Stream to a `Vector` (or even a `List`), it succeeds.
As with SPARK-38221, this is due to the use of this code pattern:
```
def someMethod (seq: Seq[xxx]) {
...
val outerDataStructure = <create outer data structure>
val newSeq = seq.map { x =>
...
code that puts something in outerDataStructure
...
}
...
code that uses outerDataStructure (and expects it to be populated)
...
}
```
If `seq` is a `Stream`, `seq.map` might be evaluated lazily, in which case
`outerDataStructure` will not be fully populated before it is used.
Both `ExtractWindowExpressions.extract` and
`ExtractWindowExpressions.addWindow` use this pattern, but the above example
failure is due to the pattern's use in `ExtractWindowExpressions.addWindow`
(`extractedWindowExprBuffer` does not get fully populated, so the Window
operator does not produce the output expected by its parent projection).
I chose `IndexedSeq` not for its efficient indexing, but because `map` will
eagerly iterate over it.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
New unit test.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]