[ 
https://issues.apache.org/jira/browse/SPARK-6962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14501054#comment-14501054
 ] 

Aaron Davidson commented on SPARK-6962:
---------------------------------------

Thanks for those log excerpts. It is likely significant that each IP appeared 
exactly once in a connection exception among the executors. Given this warning, 
but no corresponding error "Still have X requests outstanding when connection 
from 10.106.143.39 is closed", I also would be inclined to deduce that only the 
TransportServer-side of the socket is timing out, and that for some reason the 
connection exception is not reaching the client side of the socket (which would 
have caused the outstanding fetch requests to fail promptly).

If this situation could arise, then each client could be waiting indefinitely 
for some other server to respond, which it will not. Is your cluster in any 
sort of unusual network configuration?

Even so, this only could explain why the hang is indefinite, not why all 
communication is paused for 20 minutes leading up to it.

To further diagnose this, it would actually be very useful if you could turn on 
TRACE level debugging for org.apache.spark.storage.ShuffleBlockFetcherIterator 
and org.apache.spark.network (this should look like 
{{log4j.logger.org.apache.spark.network=TRACE}} in the log4j.properties).

> Netty BlockTransferService hangs in the middle of SQL query
> -----------------------------------------------------------
>
>                 Key: SPARK-6962
>                 URL: https://issues.apache.org/jira/browse/SPARK-6962
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 1.2.0, 1.2.1, 1.3.0
>            Reporter: Jon Chase
>         Attachments: jstacks.txt
>
>
> Spark SQL queries (though this seems to be a Spark Core issue - I'm just 
> using queries in the REPL to surface this, so I mention Spark SQL) hang 
> indefinitely under certain (not totally understood) circumstances.  
> This is resolved by setting spark.shuffle.blockTransferService=nio, which 
> seems to point to netty as the issue.  Netty was set as the default for the 
> block transport layer in 1.2.0, which is when this issue started.  Setting 
> the service to nio allows queries to complete normally.
> I do not see this problem when running queries over smaller (~20 5MB files) 
> datasets.  When I increase the scope to include more data (several hundred 
> ~5MB files), the queries will get through several steps but eventuall hang  
> indefinitely.
> Here's the email chain regarding this issue, including stack traces:
> http://mail-archives.apache.org/mod_mbox/spark-user/201503.mbox/<cae61spfqt2y7d5vqzomzz2dmr-jx2c2zggcyky40npkjjx4...@mail.gmail.com>
> For context, here's the announcement regarding the block transfer service 
> change: 
> http://mail-archives.apache.org/mod_mbox/spark-dev/201411.mbox/<cabpqxssl04q+rbltp-d8w+z3atn+g-um6gmdgdnh-hzcvd-...@mail.gmail.com>



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to