FelixYBW commented on PR #6009: URL: https://github.com/apache/incubator-gluten/pull/6009#issuecomment-2159745003
> around shuffle split processing. We may want to figure it out later to avoid doing such batch coalesce operations that intro It's because the initialization of current split function. Currently we use 3 loops (per column, per reducer, per row) to do the split, if the column data is cached then the solution is the best way to scale to reducer numbers. However to achieve this, we need much initialization work to create several vectors. If the input batch is small, we will suffer from the initialization overhead. Even bigger than the copy to bigger batches. Another issue is if the data size is too large and exceeds the cache size, then performance will be very poor. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
