Have you searched log for 'f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4' ?

In the snippet you posted, I don't see registration of this Executor.

Cheers

On Tue, Dec 29, 2015 at 12:43 PM, Adrian Bridgett <adr...@opensignal.com>
wrote:

> We're seeing an "Executor is not registered" error on a Spark (1.6.0rc4,
> mesos-0.26) cluster.  It seems as if the logic in
> MesosExternalShuffleService.scala isn't working for some reason (new in 1.6
> I believe).
>
> spark application sees this:
> ...
> 15/12/29 18:49:41 INFO MesosExternalShuffleClient: Successfully registered
> app a9344e17-f767-4b1e-a32e-e98922d6ca43-0014 with external shuffle service.
> 15/12/29 18:49:41 INFO MesosExternalShuffleClient: Successfully registered
> app a9344e17-f767-4b1e-a32e-e98922d6ca43-0014 with external shuffle service.
> 15/12/29 18:49:43 INFO CoarseMesosSchedulerBackend: Registered executor
> NettyRpcEndpointRef(null) (ip-10-1-201-165.ec2.internal:37660) with ID
> f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6
> 15/12/29 18:49:43 INFO ExecutorAllocationManager: New executor
> f02cb67a-3519-4655-b23a-edc0dd082bf1-S3/1 has registered (new total is 1)
> 15/12/29 18:49:43 INFO BlockManagerMasterEndpoint: Registering block
> manager ip-10-1-201-165.ec2.internal:53854 with 13.8 GB RAM,
> BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6,
> ip-10-1-201-165.ec2.internal, 53854)
> 15/12/29 18:49:43 INFO BlockManagerMasterEndpoint: Registering block
> manager ip-10-1-201-132.ec2.internal:12793 with 13.8 GB RAM,
> BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S3/1,
> ip-10-1-201-132.ec2.internal, 12793)
> 15/12/29 18:49:43 INFO ExecutorAllocationManager: New executor
> f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6 has registered (new total is 2)
> ...
> 15/12/29 18:58:06 INFO BlockManagerInfo: Added broadcast_6_piece0 in
> memory on ip-10-1-201-165.ec2.internal:53854 (size: 5.2KB, free: 13.8GB)
> 15/12/29 18:58:06 INFO MapOutputTrackerMasterEndpoint: Asked to send map
> output locations for shuffle 1 to ip-10-1-202-121.ec2.internal:59734
> 15/12/29 18:58:06 INFO MapOutputTrackerMaster: Size of output statuses for
> shuffle 1 is 1671814 bytes
> 15/12/29 18:58:06 INFO MapOutputTrackerMasterEndpoint: Asked to send map
> output locations for shuffle 1 to ip-10-1-201-165.ec2.internal:37660
> ...
> 15/12/29 18:58:07 INFO TaskSetManager: Starting task 63.0 in stage 1.0
> (TID 2191, ip-10-1-200-232.ec2.internal, partition 63,PROCESS_LOCAL, 2171
> bytes)
> 15/12/29 18:58:07 WARN TaskSetManager: Lost task 21.0 in stage 1.0 (TID
> 2149, ip-10-1-200-232.ec2.internal):
> FetchFailed(BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4,
> ip-10-1-202-114.ec2.internal, 7337), shuffleId=1, mapId=5, reduceId=21,
> message=
> org.apache.spark.shuffle.FetchFailedException: java.lang.RuntimeException:
> Executor is not registered
> (appId=a9344e17-f767-4b1e-a32e-e98922d6ca43-0014,
> execId=f02cb67a-3519-4655-b23a-edc0dd082bf1-S1/4)
>         at
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183)
> ...
> 15/12/29 18:58:07 INFO DAGScheduler: Resubmitting ShuffleMapStage 0
> (reduceByKey at
> /home/ubuntu/ajay/name-mapper/kpis/namemap_kpi_processor.py:48) and
> ShuffleMapStage 1 (reduceByKey at
> /home/ubuntu/ajay/name-mapper/kpis/namemap_kpi_processor.py:50) due to
> fetch failure
> 15/12/29 18:58:07 WARN TaskSetManager: Lost task 12.0 in stage 1.0 (TID
> 2140, ip-10-1-200-232.ec2.internal):
> FetchFailed(BlockManagerId(f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6,
> ip-10-1-201-165.ec2.internal, 7337), shuffleId=1, mapId=6, reduceId=12,
> message=
> org.apache.spark.shuffle.FetchFailedException: java.lang.RuntimeException:
> Executor is not registered
> (appId=a9344e17-f767-4b1e-a32e-e98922d6ca43-0014,
> execId=f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6)
>         at
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183)
>
>
> shuffle service itself (on executor's IP sees:
> 15/12/29 18:49:41 INFO MesosExternalShuffleBlockHandler: Received
> registration request from app a9344e17-f767-4b1e-a32e-e98922d6ca43-0014
> (remote address /10.1.200.165:37889).  (that's the driver IP)
> 15/12/29 18:49:43 WARN MesosExternalShuffleBlockHandler: Unknown /
> 10.1.201.165:52562 disconnected. (executor IP)
> 15/12/29 18:51:41 INFO MesosExternalShuffleBlockHandler: Application
> a9344e17-f767-4b1e-a32e-e98922d6ca43-0014 disconnected (address was /
> 10.1.200.165:37889). (driver IP again)
> 15/12/29 18:58:07 ERROR TransportRequestHandler: Error while invoking
> RpcHandler#receive() on RPC id 6244044000322436908
> java.lang.RuntimeException: Executor is not registered
> (appId=a9344e17-f767-4b1e-a32e-e98922d6ca43-0014,
> execId=f02cb67a-3519-4655-b23a-edc0dd082bf1-S9/6) (executor IP)
>         at
> org.apache.spark.network.shuffle.ExternalShuffleBlockResolver.getBlockData(ExternalShuffleBlockResolver.java:183)
>
> At first I wondered if reducing spark.shuffle.io.numConnectionsPerPeer
> back down to 1 (from 4) would help - maybe it wasn't keeping track of the
> number of connections.    However now with the extra debug I wonder if the
> driver is disconnecting after 2mins and the mesos shuffle service takes
> this as a sign that the whole app has finished and tidies up?
>
> I couldn't increase the time from 120s to anything higher even when I set
> this on the driver (will retry setting that on the shuffle service):
> spark.network.timeout 180s
> spark.shuffle.io.connectionTimeout 240s
>
> Adrian
> --
> *Adrian Bridgett*
>

Reply via email to