[ 
https://issues.apache.org/jira/browse/TEZ-2630?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hitesh Shah reassigned TEZ-2630:
--------------------------------

    Assignee: Hitesh Shah

> TezChild receives IP address instead of FQDN 
> ---------------------------------------------
>
>                 Key: TEZ-2630
>                 URL: https://issues.apache.org/jira/browse/TEZ-2630
>             Project: Apache Tez
>          Issue Type: Bug
>    Affects Versions: 0.5.0, 0.6.0, 0.7.0
>            Reporter: Rajat Jain
>            Assignee: Hitesh Shah
>            Priority: Critical
>         Attachments: TEZ-2630.2.patch, TEZ-2630.3.patch, TEZ-2630.patch
>
>
> I am running a yarn cluster on AWS. The slave nodes (NMs) are all configured 
> to listen on private DNS. For example, a sample node manager listens on 
> ip-10-16-141-168.ec2.internal:8042.
> When I'm trying to run a Tez job (even simple ones like select count(*) from 
> nation) - they fail because child tasks are unable to connect to the AM. The 
> issue is they are trying to connect to the IP instead of the private DNS. 
> Here's a sample log line (couple of them added by me for debugging):
> {code}
> 2015-07-21 17:08:21,919 INFO [main] task.TezChild: TezChild starting
> 2015-07-21 17:08:22,310 INFO [main] task.TezChild: Using socket factory 
> class: org.apache.hadoop.net.StandardSocketFactory
> 2015-07-21 17:08:22,336 INFO [main] task.TezChild: PID, containerIdentifier:  
> 3699, container_1437498369268_0001_01_000002
> 2015-07-21 17:08:22,418 INFO [main] Configuration.deprecation: 
> fs.default.name is deprecated. Instead, use fs.defaultFS
> 2015-07-21 17:08:23,025 INFO [main] task.TezChild: Got host:port: 
> 10.16.141.168:37949
> 2015-07-21 17:08:23,035 INFO [main] task.TezChild: address variables: 
> 10.16.141.168:37949
> 2015-07-21 17:08:23,143 INFO [TezChild] task.ContainerReporter: Attempting to 
> fetch new task
> 2015-07-21 17:08:24,201 INFO [TezChild] ipc.Client: Retrying connect to 
> server: 10.16.141.168/10.16.141.168:37949. Already tried 0 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-07-21 17:08:25,202 INFO [TezChild] ipc.Client: Retrying connect to 
> server: 10.16.141.168/10.16.141.168:37949. Already tried 1 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-07-21 17:08:26,757 INFO [TezChild] ipc.Client: Retrying connect to 
> server: 10.16.141.168/10.16.141.168:37949. Already tried 2 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> 2015-07-21 17:08:27,758 INFO [TezChild] ipc.Client: Retrying connect to 
> server: 10.16.141.168/10.16.141.168:37949. Already tried 3 time(s); retry 
> policy is RetryUpToMaximumCountWithFixedSleep(maxRetries=50, sleepTime=1000 
> MILLISECONDS)
> {code}
> AM is listening at the right address. But TezChild is receiving the IP 
> address instead of the private DNS. 
> AM logs:
> {code}
> 2015-07-21 18:09:27,906 INFO 
> [ServiceThread:org.apache.tez.dag.app.TaskAttemptListenerImpTezDag] 
> app.TaskAttemptListenerImpTezDag: Listening at address: 
> ip-10-234-2-80.ec2.internal:49967
> {code}
> TezChild logs:
> {code}
> 2015-07-21 18:09:35,353 INFO [main] task.TezChild: TezChild starting
> 2015-07-21 18:09:35,379 INFO [main] task.TezChild: Args: 
> 10.234.2.80,49967,container_1437501941642_0001_01_000002,application_1437501941642_0001,1
> 2015-07-21 18:09:35,770 INFO [main] task.TezChild: Using socket factory 
> class: org.apache.hadoop.net.StandardSocketFactory
> 2015-07-21 18:09:35,785 INFO [main] task.TezChild: PID, containerIdentifier:  
> 8670, container_1437501941642_0001_01_000002
> 2015-07-21 18:09:35,864 INFO [main] Configuration.deprecation: 
> fs.default.name is deprecated. Instead, use fs.defaultFS
> 2015-07-21 18:09:36,403 INFO [main] task.TezChild: Got host:port: 
> 10.234.2.80:49967
> 2015-07-21 18:09:36,413 INFO [main] task.TezChild: address variables: 
> 10.234.2.80:49967
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to