[GitHub] [spark] HeartSaVioR commented on a change in pull request #25552: [SPARK-28849][CORE] Add a number to control transferTo calls to avoid infinite loop in some occasional cases

GitBox Fri, 23 Aug 2019 01:53:03 -0700

HeartSaVioR commented on a change in pull request #25552: [SPARK-28849][CORE] 
Add a number to control transferTo calls to avoid infinite loop in some 
occasional cases
URL: https://github.com/apache/spark/pull/25552#discussion_r317032369


 ##########
 File path: core/src/main/scala/org/apache/spark/util/Utils.scala
 ##########
 @@ -417,16 +418,19 @@ private[spark] object Utils extends Logging {
       input: FileChannel,
       output: WritableByteChannel,
       startPosition: Long,
-      bytesToCopy: Long): Unit = {
+      bytesToCopy: Long,
+      numTransferToCalls: Int): Unit = {
     val outputInitialState = output match {
       case outputFileChannel: FileChannel =>
         Some((outputFileChannel.position(), outputFileChannel))
       case _ => None
     }
     var count = 0L
+    var num = 0
     // In case transferTo method transferred less data than we have required.
-    while (count < bytesToCopy) {
+    while (count < bytesToCopy && num < numTransferToCalls) {
       count += input.transferTo(count + startPosition, bytesToCopy - count, 
output)
 
 Review comment:
   Thanks @srowen for providing good example. That's exactly what I meant for 
first sentence.
   
   > Also it is not so good to add wait here, it will delay the shuffle write 
critical path, and in a such scenario adding delay will not recover anything, 
only just lower the system usage.
   
   If we assume IO will never be restored then yes maybe it doesn't help, but 
we don't have any evidence to back up. We're just logically thinking if it runs 
as an infinite loop, then the number should be 0 continuously. If we are 
suspecting this behavior, checking 0 should be closer to address our suspect, 
right? The advantage of checking `count` is, we are not restricted to check the 
count with `0`, as I'll also describe below.
   
   If you're concerning about shuffle write as critical path and worry about 
latency, you may have some kind of SLA in mind (like N MiB/sec) and measuring 
this should provide more information to you. You may want to just kill it when 
you're writing huge shuffle but Spark writes shuffle equal or lower than 1 
KiB/s for 10 mins - it'll write less than 1M in 10 mins and writing 1G of 
shuffle with such throughput is just effectively stuck. At least you may want 
to log message as debug or warn when the throughput for recent N mins doesn't 
make sense (I wouldn't say which number is good. Heuristic may come in, I don't 
know.).
   
   We can still get some hints from figuring out the relation of returned count 
and latency of `transferTo`. If it's tightly related, then we may have a chance 
to leverage the fact. If not, the elapsed time to finish the loop with loop 
count N is unpredictable and you may still waste minutes or even hours for loop 
count to be reaching configured number.
   
   We will never reach a good answer unless there're some valuable numbers 
being measured during the error situation. Everything for now is just 
assumption.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR commented on a change in pull request #25552: [SPARK-28849][CORE] Add a number to control transferTo calls to avoid infinite loop in some occasional cases

Reply via email to