[ 
https://issues.apache.org/jira/browse/TEZ-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ming Ma updated TEZ-3263:
-------------------------
    Description: 
Maybe the fix could be something similar to MAPREDUCE-5891. Here is one 
exception found during NM rolling restart. At least for the unordered case, it 
seems the fetcher is able to resubmit the request back to the queue and 
eventually succeed; but the fetcher still sends InputReadErrorEvent to AM for 
each retry, which could cause AM to mark the source task bad if enough 
destination tasks complaints.

{noformat}
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
        at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
        at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:579)
        at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
        at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:653)
        at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1325)
        at 
org.apache.tez.http.HttpConnection.getInputStream(HttpConnection.java:247)
        at 
org.apache.tez.runtime.library.common.shuffle.Fetcher.setupConnection(Fetcher.java:464)
{noformat}

  was:
Maybe the fix could be something similar to MAPREDUCE-5891. Here is one 
exception found during NM rolling restart.

{noformat}
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at 
java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
        at 
java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
        at 
java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:579)
        at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
        at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
        at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:653)
        at 
sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1325)
        at 
org.apache.tez.http.HttpConnection.getInputStream(HttpConnection.java:247)
        at 
org.apache.tez.runtime.library.common.shuffle.Fetcher.setupConnection(Fetcher.java:464)
{noformat}


> Improved shuffle error handling across NM restarts
> --------------------------------------------------
>
>                 Key: TEZ-3263
>                 URL: https://issues.apache.org/jira/browse/TEZ-3263
>             Project: Apache Tez
>          Issue Type: Improvement
>            Reporter: Ming Ma
>
> Maybe the fix could be something similar to MAPREDUCE-5891. Here is one 
> exception found during NM rolling restart. At least for the unordered case, 
> it seems the fetcher is able to resubmit the request back to the queue and 
> eventually succeed; but the fetcher still sends InputReadErrorEvent to AM for 
> each retry, which could cause AM to mark the source task bad if enough 
> destination tasks complaints.
> {noformat}
> java.net.ConnectException: Connection refused
>       at java.net.PlainSocketImpl.socketConnect(Native Method)
>       at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
>       at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>       at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
>       at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
>       at java.net.Socket.connect(Socket.java:579)
>       at sun.net.NetworkClient.doConnect(NetworkClient.java:175)
>       at sun.net.www.http.HttpClient.openServer(HttpClient.java:432)
>       at sun.net.www.http.HttpClient.openServer(HttpClient.java:527)
>       at sun.net.www.http.HttpClient.parseHTTP(HttpClient.java:653)
>       at 
> sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1325)
>       at 
> org.apache.tez.http.HttpConnection.getInputStream(HttpConnection.java:247)
>       at 
> org.apache.tez.runtime.library.common.shuffle.Fetcher.setupConnection(Fetcher.java:464)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to