Re: Mesos Executor Failing

2017-05-19 Thread Joseph Wu
What version of Mesos are you using?  (Just based on the word "slave" in
that error message, I'm guessing 0.28 or older.)

The "Failed to synchronize" error is something that can occur while the
agent is launching the executor.  During the launch, the agent will create
a pipe to the executor subprocess; and the executor makes a blocking read
on this pipe.  The agent will write a value to the pipe to signal the
executor to proceed.  If the agent restarts or the pipe breaks at this
point in the launch, then you'll see this error message.

On Thu, May 18, 2017 at 9:44 PM, Chawla,Sumit 
wrote:

> Hi
>
> I am facing a peculiar issue on one of the slave nodes of our cluster.  I
> have a spark cluster with 40+ nodes.  On one of the nodes, all tasks fail
> with exit code 0.
>
> ExecutorLostFailure (executor e6745c67-32e8-41ad-b6eb-8fa4d2539da7-S76
> exited caused by one of the running tasks) Reason: Unknown executor exit
> code (0)
>
>
> I cannot seem to find anything in mesos-slave.logs, and there is nothing
> being written to stdout/stderr.  Are there any debugging utitlities that i
> can use to debug what can be getting wrong on that particular slave?
>
> I tried running following but got stuck at:
>
>
> /mesos-containerizer launch 
> --command='{"environment":{},"shell":true,"value":"ls
> -ltr"}' --directory=/var/tmp/mesos/slaves/e6745c67-32e8-41ad-
> b6eb-8fa4d2539da7-S77/frameworks/e6745c67-32e8-41ad-
> b6eb-8fa4d2539da7-0312/executors/e6745c67-32e8-41ad-
> b6eb-8fa4d2539da7-S77/runs/45aa784c-f485-46a6-aeb8-997e82b80c4f
> --help=false --pipe_read=0 --pipe_write=0 --user=smi
>
> Failed to synchronize with slave (it's probably exited)
>
>
> Would apprecite pointing to any debugging methods/documentation to
> diagnose these kind of problems.
>
> Regards
> Sumit Chawla
>
>


Isolating metrics collection from master/agent slowness

2017-05-19 Thread Zhitao Li
Hi,

I'd like to start a conversation to talk about metrics collection endpoints
(especially `/metrics/snapshot`) behavior.

Right now, these endpoints are served from the same master/agent's
libprocess, and extensively uses `gauge` to chain further callbacks to
collect various metrics (DRF allocator specifically adds several metrics
per role).

This brings a problem when the system is under load: when the
master/allocator libprocess becomes busy, stats collection itself becomes
slow too. Flying dark when the system is under load is specifically painful
for an operator.

I would like to explore the direction of isolating metric collection even
when the master is slow. A couple of ideas:

- (short term) reduce usage of gauge and prefer counter (since I believe
they are less affected);
- alternative implementation of `gauge` which does not contend on
master/allocator's event queue;
- serving metrics collection from a different libprocess routine.

Any thoughts on these?

-- 
Cheers,

Zhitao Li


[GitHub] mesos pull request #214: Update vendored protobuf tar.gz to 3.3.0.

2017-05-19 Thread zhitaoli
GitHub user zhitaoli opened a pull request:

https://github.com/apache/mesos/pull/214

Update vendored protobuf tar.gz to 3.3.0.

The content of `3rdparty/protobuf-3.2.0.tar.gz` is generated by:
- On a Mac OS, download and untar protobuf v3.3.0 source at
  https://github.com/google/protobuf/archive/v3.3.0.tar.gz;
- Run `./autogen.sh`;
- Recompress and tar by `tar -czvf`.

Review: https://reviews.apache.org/r/58358

This is submitted because reviewboard does not work well with binary patch. 
See above review for other changes.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/zhitaoli/mesos 
public/zhitao/protobuf_330_binary_only

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/mesos/pull/214.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #214


commit 000a1e1b75da1897edfbd29c98f99319d991f01a
Author: Zhitao Li 
Date:   2017-04-06T20:51:57Z

Update vendored protobuf tar.gz to 3.3.0.

The content of `3rdparty/protobuf-3.2.0.tar.gz` is generated by:
- On a Mac OS, download and untar protobuf v3.3.0 source at
  https://github.com/google/protobuf/archive/v3.3.0.tar.gz;
- Run `./autogen.sh`;
- Recompress and tar by `tar -czvf`.

Review: https://reviews.apache.org/r/58358




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[RESULT][VOTE] Release Apache Mesos 1.1.2 (rc2)

2017-05-19 Thread Alex Rukletsov
Hi all,

The vote for Mesos 1.1.2 (rc2) has passed with the following votes.

+1 (Binding)
--
Vinod Kone
Till Tönshoff
Alex Rukletsov

There were no 0 or -1 votes.

Please find the release at:
https://dist.apache.org/repos/dist/release/mesos/1.1.2

It is recommended to use a mirror to download the release:
http://www.apache.org/dyn/closer.cgi

The CHANGELOG for the release is available at:
https://git-wip-us.apache.org/repos/asf?p=mesos.git;a=blob_plain;f=CHANGELOG;hb=1.1.2

The mesos-1.1.2.jar has been released to:
https://repository.apache.org

The website (http://mesos.apache.org) will be updated shortly to reflect
this release.

Thanks,
Alex & Till


Coverity Scan: Analysis completed for Mesos

2017-05-19 Thread scan-admin

Your request for analysis of Mesos has been completed successfully.
The results are available at 
https://u2389337.ct.sendgrid.net/wf/click?upn=08onrYu34A-2BWcWUl-2F-2BfV0V05UPxvVjWch-2Bd2MGckcRZ-2B0hUmbDL5L44V5w491gwG_yCAaqzzx-2F-2BA2mRMpk03t3x9hscHw355FKzcsrEtTtpGcYsYiVQewL6zikCpjaQhDLJl645MNfyngLpGyMM3aC2bKqZwhUujdTuGNtGc8ry1AFRmgblQmqEH7pVpFwX7zcdu1qD7I2NFTL8FwlVwucaHnYXfOtkkFWLwZU-2FmJMki-2FUWxxPplTws1dUN-2Fj0Uf0XFSltQRRjhKnxo53xYkznsFUngVyJBO-2B-2BtMdr9qpHxY-3D

Analysis Summary:
   New defects found: 0
   Defects eliminated: 0