Ma77Ball opened a new issue, #5383: URL: https://github.com/apache/texera/issues/5383
### Task Summary `common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/reservoirsampling/ReservoirSamplingOpExec.scala` has no dedicated spec. Add unit tests for the streaming reservoir-sampling executor covering: - **`processTuple`**: buffers silently and emits nothing per tuple; sampling results are only produced on finish. - **`onFinish` (input size == k)**: returns all input tuples in order when exactly `k` tuples arrive. - **`onFinish` (input size > k)**: keeps exactly `k` tuples, all drawn from the input stream, with no duplicates and no null padding. - **Determinism**: the RNG is seeded (`new Random(workerCount)`), so identical input yields identical samples across runs, and the result is not simply the first `k` tuples (replacement actually happens). - **Multi-worker partitioning**: `k` is split across workers via `equallyPartitionGoal` (e.g. `k=10`, `3` workers produces `4,3,3`), and the per-worker reservoirs together hold the requested `k`. - **`open`**: resets `n` and the reservoir so a reused executor re-samples from scratch. - **Edge case (input size < k)**: characterizes the current behavior. When fewer than `k` tuples arrive, the fixed-size reservoir is never fully filled, so `onFinish` emits the buffered tuples followed by `null` padding (`reservoir.iterator` over unfilled slots). This is very likely a bug, since downstream operators receive `null` tuples; a follow-up fix should drop/filter the unfilled slots (e.g. `reservoir.iterator.take(n)`). ### Task Type - [x] Testing / QA -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
