Re: Mesos Executor Failing
What version of Mesos are you using? (Just based on the word "slave" in that error message, I'm guessing 0.28 or older.) The "Failed to synchronize" error is something that can occur while the agent is launching the executor. During the launch, the agent will create a pipe to the executor subprocess; and the executor makes a blocking read on this pipe. The agent will write a value to the pipe to signal the executor to proceed. If the agent restarts or the pipe breaks at this point in the launch, then you'll see this error message. On Thu, May 18, 2017 at 9:44 PM, Chawla,Sumitwrote: > Hi > > I am facing a peculiar issue on one of the slave nodes of our cluster. I > have a spark cluster with 40+ nodes. On one of the nodes, all tasks fail > with exit code 0. > > ExecutorLostFailure (executor e6745c67-32e8-41ad-b6eb-8fa4d2539da7-S76 > exited caused by one of the running tasks) Reason: Unknown executor exit > code (0) > > > I cannot seem to find anything in mesos-slave.logs, and there is nothing > being written to stdout/stderr. Are there any debugging utitlities that i > can use to debug what can be getting wrong on that particular slave? > > I tried running following but got stuck at: > > > /mesos-containerizer launch > --command='{"environment":{},"shell":true,"value":"ls > -ltr"}' --directory=/var/tmp/mesos/slaves/e6745c67-32e8-41ad- > b6eb-8fa4d2539da7-S77/frameworks/e6745c67-32e8-41ad- > b6eb-8fa4d2539da7-0312/executors/e6745c67-32e8-41ad- > b6eb-8fa4d2539da7-S77/runs/45aa784c-f485-46a6-aeb8-997e82b80c4f > --help=false --pipe_read=0 --pipe_write=0 --user=smi > > Failed to synchronize with slave (it's probably exited) > > > Would apprecite pointing to any debugging methods/documentation to > diagnose these kind of problems. > > Regards > Sumit Chawla > >
Isolating metrics collection from master/agent slowness
Hi, I'd like to start a conversation to talk about metrics collection endpoints (especially `/metrics/snapshot`) behavior. Right now, these endpoints are served from the same master/agent's libprocess, and extensively uses `gauge` to chain further callbacks to collect various metrics (DRF allocator specifically adds several metrics per role). This brings a problem when the system is under load: when the master/allocator libprocess becomes busy, stats collection itself becomes slow too. Flying dark when the system is under load is specifically painful for an operator. I would like to explore the direction of isolating metric collection even when the master is slow. A couple of ideas: - (short term) reduce usage of gauge and prefer counter (since I believe they are less affected); - alternative implementation of `gauge` which does not contend on master/allocator's event queue; - serving metrics collection from a different libprocess routine. Any thoughts on these? -- Cheers, Zhitao Li
[GitHub] mesos pull request #214: Update vendored protobuf tar.gz to 3.3.0.
GitHub user zhitaoli opened a pull request: https://github.com/apache/mesos/pull/214 Update vendored protobuf tar.gz to 3.3.0. The content of `3rdparty/protobuf-3.2.0.tar.gz` is generated by: - On a Mac OS, download and untar protobuf v3.3.0 source at https://github.com/google/protobuf/archive/v3.3.0.tar.gz; - Run `./autogen.sh`; - Recompress and tar by `tar -czvf`. Review: https://reviews.apache.org/r/58358 This is submitted because reviewboard does not work well with binary patch. See above review for other changes. You can merge this pull request into a Git repository by running: $ git pull https://github.com/zhitaoli/mesos public/zhitao/protobuf_330_binary_only Alternatively you can review and apply these changes as the patch at: https://github.com/apache/mesos/pull/214.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #214 commit 000a1e1b75da1897edfbd29c98f99319d991f01a Author: Zhitao LiDate: 2017-04-06T20:51:57Z Update vendored protobuf tar.gz to 3.3.0. The content of `3rdparty/protobuf-3.2.0.tar.gz` is generated by: - On a Mac OS, download and untar protobuf v3.3.0 source at https://github.com/google/protobuf/archive/v3.3.0.tar.gz; - Run `./autogen.sh`; - Recompress and tar by `tar -czvf`. Review: https://reviews.apache.org/r/58358 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[RESULT][VOTE] Release Apache Mesos 1.1.2 (rc2)
Hi all, The vote for Mesos 1.1.2 (rc2) has passed with the following votes. +1 (Binding) -- Vinod Kone Till Tönshoff Alex Rukletsov There were no 0 or -1 votes. Please find the release at: https://dist.apache.org/repos/dist/release/mesos/1.1.2 It is recommended to use a mirror to download the release: http://www.apache.org/dyn/closer.cgi The CHANGELOG for the release is available at: https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.1.2 The mesos-1.1.2.jar has been released to: https://repository.apache.org The website (http://mesos.apache.org) will be updated shortly to reflect this release. Thanks, Alex & Till
Coverity Scan: Analysis completed for Mesos
Your request for analysis of Mesos has been completed successfully. The results are available at https://u2389337.ct.sendgrid.net/wf/click?upn=08onrYu34A-2BWcWUl-2F-2BfV0V05UPxvVjWch-2Bd2MGckcRZ-2B0hUmbDL5L44V5w491gwG_yCAaqzzx-2F-2BA2mRMpk03t3x9hscHw355FKzcsrEtTtpGcYsYiVQewL6zikCpjaQhDLJl645MNfyngLpGyMM3aC2bKqZwhUujdTuGNtGc8ry1AFRmgblQmqEH7pVpFwX7zcdu1qD7I2NFTL8FwlVwucaHnYXfOtkkFWLwZU-2FmJMki-2FUWxxPplTws1dUN-2Fj0Uf0XFSltQRRjhKnxo53xYkznsFUngVyJBO-2B-2BtMdr9qpHxY-3D Analysis Summary: New defects found: 0 Defects eliminated: 0