Tasks not getting killed

2018-04-03 Thread Venkat Morampudi
Hi,

We have framework that launched Spark jobs on our Mesos cluster. We are 
currently having an issue where Spark jobs are getting stuck due to some 
timeout issue. We have cancel functionality that would kill send task_kill 
message to master. When the jobs get stuck Spark driver task is not getting 
killed even though the agent on the node that driver is running get the kill 
request. Is there any timeout that I can set so that Mesos agent can force kill 
the task in this scenario? Really appreciate your help.

Thanks,
Venkat


Log entry from agent logs:

I0404 03:44:47.367276 55066 slave.cpp:2035] Asked to kill task 79668.0.0 of 
framework 35e600c2-6f43-402c-856f-9084c0040187-002



[GitHub] mesos issue #280: Fix wrong example of executor message.

2018-04-03 Thread carlonelong
Github user carlonelong commented on the issue:

https://github.com/apache/mesos/pull/280
  
Sorry, my bad. Check again pls. @xujyan 


---


Re: This Month in Mesos - March 2018

2018-04-03 Thread Judith Malnick
Thanks for the update!

On Sat, Mar 31, 2018 at 7:44 AM Gilbert Song  wrote:

> Thanks for the awesome update, Greg!
>
> - Gilbert
>
> On Fri, Mar 30, 2018 at 6:20 PM, Vinod Kone  wrote:
>
> > Thanks for the update Greg!
> >
> > Sent from my phone
> >
> > > On Mar 30, 2018, at 3:08 PM, Greg Mann  wrote:
> > >
> > > Oh hai there Apache Mesos Community!
> > >
> > > Back again with your monthly update on current events in the
> Mesosverse:
> > >
> > >
> > > *Working Groups*
> > >
> > > Below you'll find a brief summary of the group meetings from this past
> > > month, as well as some info about related work that's been happening in
> > the
> > > project. Working group meetings can be found on the Mesos community
> > calendar
> > > , and you should
> > feel
> > > free to add agenda items beforehand!
> > >
> > >
> > > *API Working Group*
> > >
> > > [Agenda Doc
> > >  > IBWw1f_Ler6fLM/edit#>
> > > ]
> > >
> > > Next Meeting: April 3 @ 11am PST
> > >
> > > In March we held the first two meetings of the new API working group!
> > This
> > > has brought about a revival of our perennial discussion on the
> preferred
> > > Mesos release cadence; you can expect an updated release policy in our
> > > documentation shortly. It's looking like the new policy will be in line
> > > with what we have been doing in practice for the last few releases, so
> no
> > > big changes there.
> > >
> > >
> > > Zhitao also presented his ongoing work on new operations which will
> allow
> > > the growing/shrinking of persistent volumes. You can find his design
> doc
> > > here
> > >  > 6EOaPzPtwYNVQUQ/edit>
> > > .
> > >
> > >
> > > *Containerization Working Group*
> > >
> > > [Agenda Doc
> > >  > c2nHR89skFXSpU/edit>
> > > ]
> > >
> > > Next meeting: April 5 @ 9am PST
> > >
> > > Two big items in the containerization space this month:
> > >
> > >
> > >   - Improvements to the Docker containerizer/executor to more
> gracefully
> > >   handle bugs in the Docker daemon: MESOS-8572
> > >   
> > >   - Configurable network namespaces for nested containers: MESOS-8534
> > >   
> > >
> > > *Community Working Group*
> > >
> > > [Agenda Doc
> > >  > 3JG4f3qg-N5En-4ubg/edit#>
> > > ]
> > >
> > > Next Meeting: April 9 @ 10:30am PST
> > >
> > > Community working group had a preliminary discussion about the next
> > > quarterly doc-a-thon, and discussed the possibility of spinning up a
> new
> > > Releases Working Group. We also discussed plans for the next MesosCon,
> > and
> > > how we may want to evolve that event going forward.
> > >
> > >
> > > *Performance Working Group*
> > >
> > > [Agenda Doc
> > >  > 4bodagrlNGCuQU/edit>
> > > ]
> > >
> > > Next meeting: April 18 @ 10am PST
> > >
> > > We now have a performance dashboard
> > >  > rapidView=238=planning>
> > > which lets you view tickets in ASF JIRA which have been marked as
> > > performance-related - take a look!
> > >
> > >
> > > Some additional copy elimination
> > >  patches have been
> > > merged, with more yet to come. The group also discussed the near-term
> > > performance roadmap, which includes optimization of
> > > authentication/authorization, master state computation, and the
> > libprocess
> > > HTTP code; see the agenda document for more details.
> > >
> > >
> > >
> > > Until next time,
> > > -Greg
> >
>
-- 
Judith Malnick
Community Manager
310-709-1517


[GitHub] mesos issue #279: WIP: Remove unknown unreachable tasks when agent re-regist...

2018-04-03 Thread m9a
Github user m9a commented on the issue:

https://github.com/apache/mesos/pull/279
  
@Gilbert88, @xujyan Sure, I will check about the rbt issue on slack.


---


[GitHub] mesos issue #279: WIP: Remove unknown unreachable tasks when agent re-regist...

2018-04-03 Thread m9a
Github user m9a commented on the issue:

https://github.com/apache/mesos/pull/279
  
The JIRA for this PR: https://issues.apache.org/jira/browse/MESOS-8750
Since @xujyan is shepherding it I intended to set him as the reviewer but 
it doesn't look like I can change those fields on the PR.


---


[GitHub] mesos issue #280: Fix wrong example of executor message.

2018-04-03 Thread xujyan
Github user xujyan commented on the issue:

https://github.com/apache/mesos/pull/280
  
Thanks for @carlonelong for the PR. The PR includes some extraneous commits 
(see above), could you clean it up?


---


[API WG] Meeting today

2018-04-03 Thread Greg Mann
Hi all,
The API working group will be meeting today at 11am PST. We'll be
discussing HTTP return codes in Mesos [1]. If you have any other items for
discussion, add them to the agenda! [2]

Cheers,
Greg


[1] https://issues.apache.org/jira/browse/MESOS-7697
[2] https://docs.google.com/document/d/1JrF7pA6gcBZ6iyeP5YgDG62ifn0cZ
IBWw1f_Ler6fLM/edit#heading=h.jvt42epwk1q7e


Re: Adding a `FLAKY` label to flaky unit tests

2018-04-03 Thread Benno Evers
Hi,

> 1) What would be the criteria for removing `FLAKY` label from a test? Who
will take care of removing this label?
The process would be exactly the same as for removing the `DISABLED` label
today, i.e. whoever feels confident that they fixed the test can remove the
label.

> 2) Do we expect that most of our tests will eventually get `FLAKY` label?
I wouldn't expect that, if we get to the point where the majority of tests
are flaky I assume we will be able to identify some systematic causes of
flaky tests that we can fix.

> Would the CI run FLAKY tests or will it filter it out?
That's the beauty of this proposal, every CI operator can decide for
themselves what the better solution for their needs will be. For us,
ideally I would like to run them 10 times and report an error only if it
failed more than once, but I'm not sure how hard this would be to implement
in jenkins. If it turns out to be too hard, I would suggest disabling them
so we can get back to a stable, green state and have failures be meaningful
again.

> What are the other reasons tests are DISABLED today?
I don't have an exhaustive list, but at least some were disabled as a
result of API changes, with the intention of fixing them later, e.g.
MESOS-8711

Best regards,

On Thu, Mar 29, 2018 at 9:22 PM, Vinod Kone  wrote:

> Would the CI run FLAKY tests or will it filter it out? I'm assuming it
> still does based on your observation above.
>
> What are the other reasons tests are DISABLED today?
>
> On Thu, Mar 29, 2018 at 10:35 AM, Meng Zhu  wrote:
>
> > +1, the advantages are appealing.
> >
> > Though I am afraid that this will probably reduce the incentive to fix
> > flaky tests.
> >
> > -Meng
> >
> > On Thu, Mar 29, 2018 at 9:45 AM, Benno Evers 
> > wrote:
> >
> > > Hi all,
> > >
> > > if you're regularly running Mesos unit tests, e.g. because you've set
> up
> > a
> > > CI system, you probably noticed that there is a lot of noise in the
> > results
> > > due to flaky tests.
> > >
> > > As a measure to ease the pain, what do you think about adding a `FLAKY`
> > > label to known flaky unit tests, similar to how we have `ROOT`,
> > `INTERNET`,
> > > `DISABLED`, etc. right now?
> > >
> > > The advantages, in my opinion, would be:
> > >  - Looking at test results, it would be immediately visible whether a
> > test
> > > failure was known flaky or not without going to JIRA
> > >  - People who want to reduce noise can disable all known flaky tests
> by a
> > > simple gtest filter
> > >  - People who want to can still run the flaky tests easier than if they
> > get
> > > disabled outright
> > >  - With a little bit of scripting, it would be possible to add logic
> like
> > > "for flaky tests, run them 10 times and only report a failure if more
> > than
> > > x% of the runs fail."
> > >
> > > What do you think?
> > >
> > > Best regards,
> > > --
> > > Benno Evers
> > > Software Engineer, Mesosphere
> > >
> >
>



-- 
Benno Evers
Software Engineer, Mesosphere


[GitHub] mesos pull request #280: Fix wrong example of executor message.

2018-04-03 Thread carlonelong
GitHub user carlonelong opened a pull request:

https://github.com/apache/mesos/pull/280

 Fix wrong example of executor message.

The example of executor message in executor-http-api.md is wrong. This 
patch fixes this.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/carlonelong/mesos master

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/mesos/pull/280.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #280


commit d37da700032990e525c50cb469325e012b027e0a
Author: longfei 
Date:   2017-12-27T13:29:16Z

Add Bytedance into powered-by-mesos

Change-Id: I13b7ac228462e3cd0ab59b40f37599b255b5c762

commit b9c519c152a1a5ed00e3fd5aa0451ac5d6cae74f
Author: Carlone 
Date:   2018-04-03T09:34:44Z

Merge pull request #1 from apache/master

Update

commit 0e5e85ece7a4c27a34ceaf372297112fc8594286
Author: longfei 
Date:   2018-04-03T09:39:45Z

Fix wrong example for executor message.

Change-Id: I6cef034b232228951aa93672a6342557444f81dc




---