Re: [PR] [#1745] feat(remote merge): Introduce a common serializer. [incubator-uniffle]

via GitHub Wed, 17 Jul 2024 03:15:44 -0700


zhengchenyu commented on PR #1916:
URL: 
https://github.com/apache/incubator-uniffle/pull/1916#issuecomment-2232950095


   @zuston 
   Tez and MR have been thoroughly tested on our internal clusters using 
production job. However, spark has not been tested with enough production job.
   
   There are still two issues to be resolved for spark.
   
   * (1) Spark SQL may still spill to disk. 
   Some aggregators assume that the records passed to them are not sorted, so 
may need large memory, then may spill to disk.
   For Hive, aggerator assumes that the records passed are sorted, so they can 
be aggregated in memory. 
   For sparksql, we need to ensure that the shuffle is sorted and provide an 
in-memory the aggregator while supporting merge sort.
   * (2) For remote merge, the records within a block need to be sorted.
   Spark's WriteBufferManager does not require sorted record within a block, 
but remote merge need. WriteBufferManager need to support sort record within a 
block. 
   
   As the two reason, spark related codes will not be submitted for now. The 
kryo is only used for spark. When submitting spark-related code, support 
KryoSerializer.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [#1745] feat(remote merge): Introduce a common serializer. [incubator-uniffle]

Reply via email to