Saisai Shao created SPARK-28849:
-----------------------------------

             Summary: Spark's UnsafeShuffleWriter may run into infinite loop in 
transferTo occasionally
                 Key: SPARK-28849
                 URL: https://issues.apache.org/jira/browse/SPARK-28849
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.3.1
            Reporter: Saisai Shao


Spark's {{UnsafeShuffleWriter}} may run into infinite loop when calling 
{{transferTo}} occasionally. What we saw is that when merging shuffle temp 
file, the task is hung for several hours until killed manually. Here's the log 
you can see, there's no any log after spill the shuffle files to disk for 
several hours.

And here is the thread dump, we could see that it is calling native method 
{{size0}}.

And we use strace to trace the system, we found that this thread is always 
calling {{fstat}}, here is the screenshot. 

We didn't find the root cause here, I guess it might be related to FS or disk 
issue. Anyway we should figure out a way to fail fast in a such scenario.



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to