Re: [PR] [#1888] feat(server): Reject sendShuffleData for an application if one of the partitions exceeds the limit [incubator-uniffle]

via GitHub Sun, 04 Aug 2024 20:49:33 -0700


maobaolong commented on code in PR #1889:
URL: 
https://github.com/apache/incubator-uniffle/pull/1889#discussion_r1703504320



##########
docs/server_guide.md:
##########
@@ -125,7 +125,7 @@ This document will introduce how to deploy Uniffle shuffle 
servers.
 | rss.server.health.checker.script.execute.timeout | 5000    | Timeout for 
`HealthScriptChecker` execute health script.(ms)                                
                                                                                
                |
 
 ### Huge Partition Optimization
-A huge partition is a common problem for Spark/MR and so on, caused by data 
skew. And it can cause the shuffle server to become unstable. To solve this, we 
introduce some mechanisms to limit the writing of huge partitions to avoid 
affecting regular partitions, more details can be found in 
[ISSUE-378](https://github.com/apache/incubator-uniffle/issues/378). The basic 
rules for limiting large partitions are memory usage limits and flushing 
individual buffers directly to persistent storage.
+A huge partition is a common problem for Spark/MR and so on, caused by data 
skew. And it can cause the shuffle server to become unstable. To solve this, we 
introduce some mechanisms to limit the writing of huge partitions to avoid 
affecting regular partitions, and introduce a hard limit config to reject 
extreme huge partition, more details can be found in 
[ISSUE-378](https://github.com/apache/incubator-uniffle/issues/378). The basic 
rules for limiting large partitions are memory usage limits and flushing 
individual buffers directly to persistent storage.

Review Comment:
   done



##########
docs/server_guide.md:
##########
@@ -144,6 +144,11 @@ For HADOOP FS, the conf value of 
`rss.server.single.buffer.flush.threshold` shou
 
 Finally, to improve the speed of writing to HDFS for a single partition, the 
value of `rss.server.max.concurrency.of.per-partition.write` and 
`rss.server.flush.hdfs.threadPool.size` could be increased to 50 or 100.
 
+#### Hard limit
+Once the huge partition reach the hard limit size, the conf is 
`rss.server.huge-partition.size.hard.limit`, server reject the sendShuffleData 
request and do not retry for client, so that client can fail fast and user can 
modify their sql or job to avoid the reach the partition hard limit.

Review Comment:
   done



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [PR] [#1888] feat(server): Reject sendShuffleData for an application if one of the partitions exceeds the limit [incubator-uniffle]

Reply via email to