[GitHub] [spark] Ngone51 commented on a change in pull request #33446: [SPARK-36215][SHUFFLE] Add logging for slow fetches to diagnose external shuffle service issues

GitBox Fri, 23 Jul 2021 07:38:38 -0700


Ngone51 commented on a change in pull request #33446:
URL: https://github.com/apache/spark/pull/33446#discussion_r675318136




##########
File path: core/src/main/scala/org/apache/spark/internal/config/package.scala
##########
@@ -1214,6 +1214,29 @@ package object config {
       .checkValue(_ > 0, "The max no. of blocks in flight cannot be 
non-positive.")
       .createWithDefault(Int.MaxValue)
 
+  private[spark] val REDUCER_SHUFFLE_FETCH_SLOW_LOG_THRESHOLD_MS =
+    ConfigBuilder("spark.reducer.shuffleFetchSlowLogThreshold.time")
+      .doc("When fetching blocks from an external shuffle service is slower 
than expected, the " +
+        "fetch will be logged to allow for subsequent investigation. A fetch 
is determined " +
+        "to be slow if it has a total duration of at least this value, and a 
transfer rate " +
+        // cannot reference val REDUCER_SHUFFLE_FETCH_SLOW_LOG_THRESHOLD_BPS 
since its uninitialized

Review comment:
       Why not declare `REDUCER_SHUFFLE_FETCH_SLOW_LOG_THRESHOLD_BPS` first?

##########
File path: core/src/main/scala/org/apache/spark/TestUtils.scala
##########
@@ -448,6 +451,34 @@ private[spark] object TestUtils {
       EnumSet.of(OWNER_READ, OWNER_EXECUTE, OWNER_WRITE))
     file.getPath
   }
+
+  /**
+   * A log4j-specific log capturing mechanism. Provided with a class name, it 
will add a new
+   * appender to the log for that class to capture the output. log4j must be 
properly configured
+   * on the classpath (e.g., with slf4j-log4j12) for this to work. This should 
be closed when it is
+   * done to remove the temporary appender.
+   */
+  class Log4jCapture(val loggerName: String) extends AutoCloseable {

Review comment:
       We can use `FunctionSuite.withLogAppender()`.

##########
File path: docs/configuration.md
##########
@@ -2064,6 +2064,28 @@ Apart from these, the following properties are also 
available, and may be useful
   </td>
   <td>1.2.0</td>
 </tr>
+<tr>
+  <td><code>spark.reducer.shuffleFetchSlowLogThreshold.time</code></td>
+  <td>value of 
<code>spark.reducer.shuffleFetchSlowLogThreshold.time</code></td>
+  <td>
+    When fetching blocks from an external shuffle service is slower than 
expected, the
+    fetch will be logged to allow for subsequent investigation. A fetch is 
determined
+    to be slow if it has a total duration of at least this value, and a 
transfer rate
+    less than 
<code>spark.reducer.shuffleFetchSlowLogThreshold.bytesPerSec</code>.
+  </td>
+  <td>3.2.0</td>
+</tr>
+<tr>
+  <td><code>spark.reducer.shuffleFetchSlowLogThreshold.bytesPerSec</code></td>
+  <td>value of 
<code>spark.reducer.shuffleFetchSlowLogThreshold.bytesPerSec</code></td>

Review comment:
       same here.

##########
File path: docs/configuration.md
##########
@@ -2064,6 +2064,28 @@ Apart from these, the following properties are also 
available, and may be useful
   </td>
   <td>1.2.0</td>
 </tr>
+<tr>
+  <td><code>spark.reducer.shuffleFetchSlowLogThreshold.time</code></td>
+  <td>value of 
<code>spark.reducer.shuffleFetchSlowLogThreshold.time</code></td>

Review comment:
       This's cell is for the default value. I think it's `1s` here.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] Ngone51 commented on a change in pull request #33446: [SPARK-36215][SHUFFLE] Add logging for slow fetches to diagnose external shuffle service issues

Reply via email to