The GitHub Actions job "License Binary Checker" on texera.git/release/v1.2 has 
failed.
Run started by GitHub user xuang7 (triggered by xuang7).

Head commit for run:
b876b475d93c37696b8da0436d7ad3e7263e6317 / Suyash Jain 
<[email protected]>
fix(workflow-operator): no null padding in reservoir sampling (#5606)

### What changes were proposed in this PR?

`ReservoirSamplingOpExec` allocates a fixed-size reservoir of length
`count` (the per-worker share of `k`). When a worker receives fewer
tuples than `count`, only the first `n` slots are filled, but `onFinish`
returned the whole array, yielding `count - n` trailing `null` entries.
The nulls are currently swallowed by a distant null-guard in
`DataProcessor`, so the bug is latent — but the operator violates the
"do not emit null tuples" contract and breaks if that guard is ever
narrowed or bypassed.

```
Before:  input < k  ->  onFinish emits [t0 .. tn-1, null, ..., null]  (engine 
guard hides them)
After:   input < k  ->  onFinish emits [t0 .. tn-1]                   (no nulls 
emitted at all)
```

The fix emits only the filled prefix:

```scala
override def onFinish(port: Int): Iterator[TupleLike] = 
reservoir.iterator.take(n)
```

`take(n)` is a no-op when `n >= count` (input ≥ k), so the sampled
output is unchanged in the normal case.

### Any related issues, documentation, discussions?

Closes #5592

### How was this PR tested?

Added three regression cases to `ReservoirSamplingOpExecSpec`:

| Case | Asserts |
| --- | --- |
| `input size < k` | only the received tuples are emitted, in order, no
nulls |
| empty input | `onFinish` emits nothing |
| skewed partitioning (`k=10`, 3 workers, worker 0 gets 2 tuples) | no
null padding for an under-filled worker share |

All three fail against the old `reservoir.iterator` and pass with
`reservoir.iterator.take(n)`; the 9 pre-existing cases stay green (TDD
red → green verified by stashing the source fix).

```
sbt "WorkflowOperator/testOnly 
org.apache.texera.amber.operator.reservoirsampling.ReservoirSamplingOpExecSpec"
# Tests: succeeded 12, failed 0, canceled 0, ignored 0, pending 0
```

`sbt WorkflowOperator/scalafixAll` and `sbt
WorkflowOperator/scalafmtAll` produce no further diff.

### Was this PR authored or co-authored using generative AI tooling?

Yes, partially. I (Suyash Jain) worked on this PR together with Claude
Code as a pair-programming assistant. I reviewed the final diff, ran the
spec locally, and verified the red → green behavior of the new
regression tests myself before opening the PR.

Generated-by: Claude Code (Claude Opus 4.7)

(backported from commit d5f5e12fb6879f15dbcf0c9cf6aaae3b532784e6)

Co-authored-by: Xuan Gu <[email protected]>

Report URL: https://github.com/apache/texera/actions/runs/27444261750

With regards,
GitHub Actions via GitBox

Reply via email to