[ https://issues.apache.org/jira/browse/SPARK-17468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15476952#comment-15476952 ]
Sean Owen commented on SPARK-17468: ----------------------------------- Doesn't the worker then die? I'm not clear in this case why you'd have workers still running for any significant period of time. > Cluster workers crushed when master network bad more than one > WORKER_TIMEOUT_MS! > -------------------------------------------------------------------------------- > > Key: SPARK-17468 > URL: https://issues.apache.org/jira/browse/SPARK-17468 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 1.6.1 > Environment: CentOS 6.5, Spark standalone, 15 machines,15worker and > 2master,there are worker,master,driver on the same machine > Reporter: zhangzhiyan > Priority: Critical > Labels: Spark, WORKER_TIMEOUT_MS, crush, standalone > Original Estimate: 168h > Remaining Estimate: 168h > > I'm from China.My production spark standalone is crushed on 9.9 sales, please > help me to tell how to solve this problem,thanks. > master log is below: > 16/09/09 09:49:57 WARN Master: Removing > worker-20160814124907-10.205.130.37-16590 because we got no heartbeat in 60 > seconds > 16/09/09 09:49:57 WARN Master: Removing > worker-20160814113016-10.205.130.13-57487 because we got no heartbeat in 60 > seconds > 16/09/09 09:49:57 WARN Master: Removing > worker-20160814134926-10.205.130.39-11430 because we got no heartbeat in 60 > seconds > 16/09/09 09:49:57 WARN Master: Removing > worker-20160814131257-10.205.130.38-32160 because we got no heartbeat in 60 > seconds > 16/09/09 09:49:57 WARN Master: Removing > worker-20160814161444-10.205.136.19-14196 because we got no heartbeat in 60 > seconds > 16/09/09 09:49:57 WARN Master: Removing > worker-20160814141654-10.205.130.42-49707 because we got no heartbeat in 60 > seconds > 16/09/09 09:49:57 WARN Master: Removing > worker-20160814115125-10.205.130.14-38381 because we got no heartbeat in 60 > seconds > 16/09/09 09:49:57 WARN Master: Removing > worker-20160814152146-10.205.136.10-24730 because we got no heartbeat in 60 > seconds > 16/09/09 09:49:57 WARN Master: Removing > worker-20160814122817-10.205.130.36-54348 because we got no heartbeat in 60 > seconds > 16/09/09 09:49:57 WARN Master: Removing > worker-20160814170452-10.205.136.34-9921 because we got no heartbeat in 60 > seconds > 16/09/09 09:49:58 WARN Master: Removing > worker-20160814154744-10.205.136.12-12399 because we got no heartbeat in 60 > seconds > 16/09/09 09:49:58 WARN Master: Removing > worker-20160814150355-10.205.130.44-5792 because we got no heartbeat in 60 > seconds > 16/09/09 09:49:58 WARN Master: Removing > worker-20160814143901-10.205.130.43-2223 because we got no heartbeat in 60 > seconds > 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker > worker-20160814124907-10.205.130.37-16590. Asking it to re-register. > 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker > worker-20160814170452-10.205.136.34-9921. Asking it to re-register. > 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker > worker-20160814141654-10.205.130.42-49707. Asking it to re-register. > 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker > worker-20160814115125-10.205.130.14-38381. Asking it to re-register. > 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker > worker-20160814134926-10.205.130.39-11430. Asking it to re-register. > 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker > worker-20160814131257-10.205.130.38-32160. Asking it to re-register. > 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker > worker-20160814150355-10.205.130.44-5792. Asking it to re-register. > 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker > worker-20160814154744-10.205.136.12-12399. Asking it to re-register. > 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker > worker-20160814161444-10.205.136.19-14196. Asking it to re-register. > 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker > worker-20160814113016-10.205.130.13-57487. Asking it to re-register. > 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker > worker-20160814152146-10.205.136.10-24730. Asking it to re-register. > 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker > worker-20160814143901-10.205.130.43-2223. Asking it to re-register. > 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker > worker-20160814122817-10.205.130.36-54348. Asking it to re-register. > 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker > worker-20160814124907-10.205.130.37-16590. Asking it to re-register. > 16/09/09 09:49:58 WARN Master: Got heartbeat from unregistered worker > worker-20160814170452-10.205.136.34-9921. Asking it to re-register. > and I found the code here may be wrong, when master network is not ok more > than WORKER_TIMEOUT_MS, master will remove worker and executor information in > it's memory, but when workers recover connection again with master > quickly,because it's old info has been erased on master, despite it still > running the old executors, master will allocate more resource than worker can > afford,that comes crush my workers. > So I try to increase WORKER_TIMEOUT_MS to 3 minutes, is that ok?Can you give > me some advice? > code address: > org.apache.spark.deploy.master.Master,line 1023 > /** Check for, and remove, any timed-out workers */ > private def timeOutDeadWorkers() { > // Copy the workers into an array so we don't modify the hashset while > iterating through it > val currentTime = System.currentTimeMillis() > val toRemove = workers.filter(_.lastHeartbeat < currentTime - > WORKER_TIMEOUT_MS).toArray > for (worker <- toRemove) { > if (worker.state != WorkerState.DEAD) { > logWarning("Removing %s because we got no heartbeat in %d > seconds".format( > worker.id, WORKER_TIMEOUT_MS / 1000)) > removeWorker(worker) > } else { > if (worker.lastHeartbeat < currentTime - ((REAPER_ITERATIONS + 1) * > WORKER_TIMEOUT_MS)) { > workers -= worker // we've seen this DEAD worker in the UI, etc. > for long enough; cull it > } > } > } > } -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org