[GitHub] [incubator-uniffle] advancedxy commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

GitBox Tue, 17 Jan 2023 03:21:04 -0800


advancedxy commented on code in PR #494:
URL: https://github.com/apache/incubator-uniffle/pull/494#discussion_r1072085066



##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java:
##########
@@ -574,11 +584,20 @@ public void removeResources(String appId) {
           new AppPurgeEvent(appId, getUserByAppId(appId), new 
ArrayList<>(shuffleToCachedBlockIds.keySet()))
       );
     }
+    if (shuffleTaskInfo.hasHugePartition()) {
+      ShuffleServerMetrics.gaugeAppWithHugePartitionNum.dec();
+      
ShuffleServerMetrics.gaugeHugePartitionNum.dec(shuffleTaskInfo.getHugePartitionSize());
+    }
     LOG.info("Finish remove resource for appId[" + appId + "] cost " + 
(System.currentTimeMillis() - start) + " ms");
   }
 
   public void refreshAppId(String appId) {
-    shuffleTaskInfos.computeIfAbsent(appId, x -> new 
ShuffleTaskInfo()).setCurrentTimes(System.currentTimeMillis());
+    shuffleTaskInfos.computeIfAbsent(
+        appId,
+        x -> {
+          ShuffleServerMetrics.counterTotalAppNum.inc();

Review Comment:
   > This might be inaccurate
   
   This problem is not addressed.
   `refreshAppId` could be called by multiple calls. For example, a spark app 
finishes an stage(with shuffle), then some local computing then submit new 
stage(with shuffle). The shuffle server will account this app multiple times.
   
   I can think of two proposals to fix this problem:
   1. Modify the `ShuffleRegisterRequest` in proto to include a `firstShuffle` 
boolean field to indicate this is the first shuffle register, mapping to the 
app.
   2. Shuffle server maintains another data structure to maintain which appIds 
has been seen. But it's not evicted until the configured size is reached, let's 
say 100_000?
   Both two proposals require the metric inc should be happened in 
`registerShuffle` call.



##########
server/src/main/java/org/apache/uniffle/server/ShuffleTaskManager.java:
##########
@@ -574,11 +584,20 @@ public void removeResources(String appId) {
           new AppPurgeEvent(appId, getUserByAppId(appId), new 
ArrayList<>(shuffleToCachedBlockIds.keySet()))
       );
     }
+    if (shuffleTaskInfo.hasHugePartition()) {
+      ShuffleServerMetrics.gaugeAppWithHugePartitionNum.dec();
+      
ShuffleServerMetrics.gaugeHugePartitionNum.dec(shuffleTaskInfo.getHugePartitionSize());
+    }
     LOG.info("Finish remove resource for appId[" + appId + "] cost " + 
(System.currentTimeMillis() - start) + " ms");
   }
 
   public void refreshAppId(String appId) {
-    shuffleTaskInfos.computeIfAbsent(appId, x -> new 
ShuffleTaskInfo()).setCurrentTimes(System.currentTimeMillis());
+    shuffleTaskInfos.computeIfAbsent(
+        appId,
+        x -> {
+          ShuffleServerMetrics.counterTotalAppNum.inc();

Review Comment:
   > This might be inaccurate
   
   This problem is not addressed.
   `refreshAppId` could be called by multiple calls. For example, a spark app 
finishes an stage(with shuffle), then some local computing then submit new 
stage(with shuffle). The shuffle server will account this app multiple times.
   
   I can think of two proposals to fix this problem:
   1. Modify the `ShuffleRegisterRequest` in proto to include a `firstShuffle` 
boolean field to indicate this is the first shuffle register, mapping to the 
app.
   2. Shuffle server maintains another data structure to maintain which appIds 
has been seen. But it's not evicted until the configured size is reached, let's 
say 100_000?
   3. 
   Both two proposals require the metric inc should be happened in 
`registerShuffle` call.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [incubator-uniffle] advancedxy commented on a diff in pull request #494: [ISSUE-378][HugePartition][Part-3] Introduce more metrics about huge partition

Reply via email to