Ma77Ball opened a new issue, #5383:
URL: https://github.com/apache/texera/issues/5383

   ### Task Summary
   
   
`common/workflow-operator/src/main/scala/org/apache/texera/amber/operator/reservoirsampling/ReservoirSamplingOpExec.scala`
 has no dedicated spec. Add unit tests for the streaming reservoir-sampling 
executor covering:
   
   - **`processTuple`**: buffers silently and emits nothing per tuple; sampling 
results are only produced on finish.
   - **`onFinish` (input size == k)**: returns all input tuples in order when 
exactly `k` tuples arrive.
   - **`onFinish` (input size > k)**: keeps exactly `k` tuples, all drawn from 
the input stream, with no duplicates and no null padding.
   - **Determinism**: the RNG is seeded (`new Random(workerCount)`), so 
identical input yields identical samples across runs, and the result is not 
simply the first `k` tuples (replacement actually happens).
   - **Multi-worker partitioning**: `k` is split across workers via 
`equallyPartitionGoal` (e.g. `k=10`, `3` workers produces `4,3,3`), and the 
per-worker reservoirs together hold the requested `k`.
   - **`open`**: resets `n` and the reservoir so a reused executor re-samples 
from scratch.
   - **Edge case (input size < k)**: characterizes the current behavior. When 
fewer than `k` tuples arrive, the fixed-size reservoir is never fully filled, 
so `onFinish` emits the buffered tuples followed by `null` padding 
(`reservoir.iterator` over unfilled slots). This is very likely a bug, since 
downstream operators receive `null` tuples; a follow-up fix should drop/filter 
the unfilled slots (e.g. `reservoir.iterator.take(n)`).
   
   ### Task Type
   
   - [x] Testing / QA


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to