zhli1142015 commented on PR #12097:
URL: https://github.com/apache/gluten/pull/12097#issuecomment-4459250747

   @marin-ma I re-checked the current code paths. `VeloxResizeBatchesExec` can 
address a similar symptom, but I do not think it is an exact replacement for 
this reader-side merge. In our internal evaluations, we have seen cases where 
the cost of `VeloxResizeBatchesExec` is higher than its benefit and causes 
regressions, so we have not been relying on it for those workloads.
   
   The issue we are trying to address here is not really a shuffle regression 
by itself. After #10499, the hash shuffle reader can emit one `RowVector` / 
`ColumnarBatch` per payload. When the shuffle output contains many small 
payloads, the downstream operators after shuffle have to consume many small 
vectors, and that is where the overhead shows up. Restoring coalescing inside 
the hash shuffle reader reduces this fragmentation before those vectors reach 
the downstream operators.
   
   I also re-checked the side-effect risk in this PR. The merge is local to the 
hash shuffle reader, capped by `batchSize`, flushed at Spark shuffle stream 
boundaries, and skipped for dictionary / complex payloads. If the next payload 
cannot be merged, the code carries it over instead of crossing the size limit. 
So the behavior remains conservative and should not change ordering or mix data 
across streams.
   
   `VeloxResizeBatchesExec` can still be useful when it is enabled, especially 
because it can combine across stream boundaries, but this PR does not remove or 
block that path. These two mechanisms can coexist: the reader-side fast path 
handles the common plain hash-shuffle payload fragmentation early, and 
`VeloxResizeBatchesExec` can still do additional resizing afterwards if a 
deployment chooses to enable it. So I do not see this as an either-or choice; 
keeping this targeted reader-side coalescing looks reasonable to me.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to