Ma77Ball opened a new pull request, #5384:
URL: https://github.com/apache/texera/pull/5384

   ### What changes were proposed in this PR?
   - Add `ReservoirSamplingOpExecSpec`, the first dedicated spec for the 
streaming reservoir-sampling executor.
   - Cover `processTuple` buffering (emits nothing per tuple), and `onFinish` 
returning all input tuples in order when input size equals k.
   - Cover keeping exactly k unique, input-drawn samples when input exceeds k, 
determinism of the seeded RNG (with a check that replacement actually happens), 
multi-worker partitioning of k via `equallyPartitionGoal` (k=10 across 3 
workers yields 4,3,3), and `open` resetting state for executor reuse.
   - Characterize the input-size-below-k edge case, where the unfilled 
fixed-size reservoir emits null padding on finish (flagged in the spec as a 
likely bug for a follow-up fix).
   ### Any related issues, documentation, or discussions?
   Closes: #5383
   ### How was this PR tested?
   - Run `sbt "WorkflowOperator/testOnly *ReservoirSamplingOpExecSpec"` and 
expect all 7 examples to pass.
   - This is a test-only PR (no production code changed), so the spec itself is 
the verification; the input-below-k example intentionally asserts the current 
null-padding behavior, so a green run confirms that characterization rather 
than a fix.
   - Local caveat: a full local run is blocked by JaCoCo 0.8.11 failing to 
instrument `JsonSchemaDraft.class` under JDK 25, so it was verified to compile 
via `WorkflowOperator/Test/compile` and runs on CI's supported JDK.
   ### Was this PR authored or co-authored using generative AI tooling?
   Co-authored with Claude Opus 4.8 in compliance with ASF


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to