zhengchenyu commented on PR #1916: URL: https://github.com/apache/incubator-uniffle/pull/1916#issuecomment-2232950095
@zuston Tez and MR have been thoroughly tested on our internal clusters using production job. However, spark has not been tested with enough production job. There are still two issues to be resolved for spark. * (1) Spark SQL may still spill to disk. Some aggregators assume that the records passed to them are not sorted, so may need large memory, then may spill to disk. For Hive, aggerator assumes that the records passed are sorted, so they can be aggregated in memory. For sparksql, we need to ensure that the shuffle is sorted and provide an in-memory the aggregator while supporting merge sort. * (2) For remote merge, the records within a block need to be sorted. Spark's WriteBufferManager does not require sorted record within a block, but remote merge need. WriteBufferManager need to support sort record within a block. As the two reason, spark related codes will not be submitted for now. The kryo is only used for spark. When submitting spark-related code, support KryoSerializer. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
