[jira] [Commented] (MESOS-2695) Add master flag to enable/disable oversubscription

2015-09-05 Thread Klaus Ma (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-2695?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731919#comment-14731919
 ] 

Klaus Ma commented on MESOS-2695:
-

One case is to disable oversubscription feature because of some unknown issues 
in DC, it's hard for Operators to re-start all slaves to re-set 
{{--resource_estimator}}.

> Add master flag to enable/disable oversubscription
> --
>
> Key: MESOS-2695
> URL: https://issues.apache.org/jira/browse/MESOS-2695
> Project: Mesos
>  Issue Type: Task
>Reporter: Vinod Kone
>  Labels: twitter
>
> This flag lets an operator control cluster level oversubscription. 
> The master should send revocable offers to framework if this flag is enabled 
> and the framework opts in to receive them.
> Master should ignore revocable resources from slaves if the flag is disabled.
> Need tests for all these scenarios.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3022) export additional metrics from scheduler driver

2015-09-05 Thread Yong Qiao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731932#comment-14731932
 ] 

Yong Qiao Wang commented on MESOS-3022:
---

[~klausma1982] and [~haosd...@gmail.com],  Thank you so much for your important 
comments.

So in this ticket, it only needs to add a count of messages by message type, 
and the version information can be get from http endpoint after MESOS-1841 is 
applied. 

> export additional metrics from scheduler driver
> ---
>
> Key: MESOS-3022
> URL: https://issues.apache.org/jira/browse/MESOS-3022
> Project: Mesos
>  Issue Type: Improvement
>Reporter: David Robinson
>Assignee: Yong Qiao Wang
>Priority: Minor
>
> The scheduler driver only exports the metrics below, but ideally it would 
> export its version and a count of messages by message type.
> {code}
> $ curl -s localhost:20902/metrics/snapshot | python -m json.tool
> {
> "scheduler/event_queue_dispatches": 0,
> "scheduler/event_queue_messages": 0,
> "system/cpus_total": 24,
> "system/load_15min": 0.49,
> "system/load_1min": 0.36,
> "system/load_5min": 0.46,
> "system/mem_free_bytes": 269713408,
> "system/mem_total_bytes": 33529266176
> }
> {code}
> The scheduler driver version could be used during troubleshooting to identify 
> frameworks that are using an old, potentially backwards incompatible, 
> scheduler driver (eg, a framework hasn't been restarted after a Mesos deploy, 
> so it still links against an old incompatible libmesos).
> A count of messages by message type would help identify a problem w/ a 
> specific feature, eg task reconciliation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3372) Allow mesos agent attributes to be tokenized in taskInfo

2015-09-05 Thread Chad Heuschober (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732085#comment-14732085
 ] 

Chad Heuschober commented on MESOS-3372:


Happy to do so; We have a (docker) application workload that needs to 
understand its resource locations to make the best possible decisions about 
where to look for services or data it consumes. Our networking model is calico 
ip-per-container.

The workload wants to prioritize its service discovery to use a service in the 
same node followed by rack, then rack pair, then across the dc. Without the 
ability to inject mesos-agent specific attributes into the taskInfo we have to 
jump out of the application and either ask the framework to discover or inject 
this information for us or use an service discovery agent (eg, mesos-dns or 
mesos-consul) that is, itself, dependent upon parsing what is exposed in 
`state.json`. We looked at anycast as well but it added too much complexity to 
the deployment at this time.

While any of the above are options, a framework is a big lift just to give an 
application its rack awareness and service discovery agents are in a difficult 
position knowing what should be exposed. Since they don't necessarily know what 
each workload needs, it feels like that's walking down the road of building a 
cartesian product of each application and slave attribute and it doesn't do 
anyone much good to just fork those projects to add the discovery information 
we want.

It is, in my mind at least, more elegant to allow slave-attributes to be 
templated into the taskInfo so they can be reused in environment variables or 
even task configuration scenarios.

I've seen (and please forgive me for not finding it right now) a JIRA issue 
that essentially requested some agent host information be auto discovered like 
CPU type, network interface, etc. While it would absolutely be neat to have 
that auto-discovered, even statically defining such attributes at the agent 
config and allowing them to be templated into TaskInfo can enable applications 
to be smarter about using the resources they're on.

> Allow mesos agent attributes to be tokenized in taskInfo
> 
>
> Key: MESOS-3372
> URL: https://issues.apache.org/jira/browse/MESOS-3372
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chad Heuschober
>
> Some applications workloads would benefit from having access to the 
> statically defined slave attributes. By processing `taskInfo` on the slave 
> such tokens, as defined in `taskInfo` could be replaced with the appropriate 
> values to achieve such objectives as rack locality.
> Example:
> Before token replacement:
> {code}
> {
>   "discovery": {
> "environment": "RACK_@MESOS.AGENT.ATTRS.RACK_ID@"
>   }
> }
> {code}
> After token replacement:
> {code}
> {
>   "discovery": {
> "environment": "RACK_DC131R57"
>   }
> }
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-3349) PersistentVolumeTest.AccessPersistentVolume fails when run as root.

2015-09-05 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent reassigned MESOS-3349:
---

Assignee: haosdent

> PersistentVolumeTest.AccessPersistentVolume fails when run as root.
> ---
>
> Key: MESOS-3349
> URL: https://issues.apache.org/jira/browse/MESOS-3349
> Project: Mesos
>  Issue Type: Bug
>  Components: test
> Environment: Ubuntu 14.04, CentOS 5
>Reporter: Benjamin Mahler
>Assignee: haosdent
>  Labels: flaky-test
>
> When running the tests as root:
> {noformat}
> [ RUN  ] PersistentVolumeTest.AccessPersistentVolume
> I0901 02:17:26.435140 39432 exec.cpp:133] Version: 0.25.0
> I0901 02:17:26.442129 39461 exec.cpp:207] Executor registered on slave 
> 20150901-021726-1828659978-52102-32604-S0
> Registered executor on hostname
> Starting task d8ff1f00-e720-4a61-b440-e111009dfdc3
> sh -c 'echo abc > path1/file'
> Forked command at 39484
> Command exited with status 0 (pid: 39484)
> ../../src/tests/persistent_volume_tests.cpp:579: Failure
> Value of: os::exists(path::join(directory, "path1"))
>   Actual: true
> Expected: false
> [  FAILED  ] PersistentVolumeTest.AccessPersistentVolume (777 ms)
> {noformat}
> FYI [~jieyu] [~mcypark]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-2863) Command executor can send TASK_KILLED after TASK_FINISHED

2015-09-05 Thread Vaibhav Khanduja (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-2863?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vaibhav Khanduja reassigned MESOS-2863:
---

Assignee: Vaibhav Khanduja

> Command executor can send TASK_KILLED after TASK_FINISHED
> -
>
> Key: MESOS-2863
> URL: https://issues.apache.org/jira/browse/MESOS-2863
> Project: Mesos
>  Issue Type: Bug
>Reporter: Vinod Kone
>Assignee: Vaibhav Khanduja
>  Labels: newbie++
>
> Observed this while doing some tests in our test cluster.
> If the command executor gets a shutdown() (e.g., framework unregistered) 
> after sending TASK_FINISHED but before exiting (there is a forced sleep), it 
> could send a TASK_KILLED update to the slave.
> Ideally the command executor should not send multiple terminal updates.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3371) Implement process::subprocess on Windows

2015-09-05 Thread haosdent (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3371?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732009#comment-14732009
 ] 

haosdent commented on MESOS-3371:
-

Hi, [~hausdorff] Could I assigned this task to me? Or you have already prepare 
a patch for this?

> Implement process::subprocess on Windows
> 
>
> Key: MESOS-3371
> URL: https://issues.apache.org/jira/browse/MESOS-3371
> Project: Mesos
>  Issue Type: Task
>  Components: libprocess
>Reporter: Alex Clemmer
>  Labels: libprocess, mesosphere
>
> From a discussion with mpark we (IIRC) concluded that even on Windows we call 
> this a couple times. We need to (1) confirm, and (2) do it.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (MESOS-3136) COMMAND health checks with Marathon 0.10.0 are broken

2015-09-05 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent updated MESOS-3136:

Shepherd: Timothy Chen  (was: Adam B)

> COMMAND health checks with Marathon 0.10.0 are broken
> -
>
> Key: MESOS-3136
> URL: https://issues.apache.org/jira/browse/MESOS-3136
> Project: Mesos
>  Issue Type: Bug
>Affects Versions: 0.23.0
>Reporter: Dr. Stefan Schimanski
>Assignee: haosdent
>Priority: Critical
>
> When deploying Mesos 0.23rc4 with latest Marathon 0.10.0 RC3 command health 
> check stop working. Rolling back to Mesos 0.22.1 fixes the problem.
> Containerizer is Docker.
> All packages are from official Mesosphere Ubuntu 14.04 sources.
> The issue must be analyzed further.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-3367) Mesos fetcher does not extract archives for URI with parameters

2015-09-05 Thread haosdent (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3367?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

haosdent reassigned MESOS-3367:
---

Assignee: haosdent

> Mesos fetcher does not extract archives for URI with parameters
> ---
>
> Key: MESOS-3367
> URL: https://issues.apache.org/jira/browse/MESOS-3367
> Project: Mesos
>  Issue Type: Bug
>  Components: fetcher
>Affects Versions: 0.22.1, 0.23.0
> Environment: DCOS 1.1
>Reporter: Renat Zubairov
>Assignee: haosdent
>Priority: Minor
>
> I'm deploying using marathon applications with sources served from S3. I'm 
> using a signed URL to give only temporary access to the S3 resources, so URL 
> of the resource have some query parameters.
> So URI is 'https://foo.com/file.tgz?hasi' and fetcher stores it in the file 
> with the name 'file.tgz?hasi', then it thinks that extension 'hasi' is not 
> tgz hence extraction is skipped, despite the fact that MIME Type of the HTTP 
> resource is 'application/x-tar'.
> Workaround - add additional parameter like '=.tgz'



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3022) export additional metrics from scheduler driver

2015-09-05 Thread Yong Qiao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14731964#comment-14731964
 ] 

Yong Qiao Wang commented on MESOS-3022:
---

Append the related review request: https://reviews.apache.org/r/38145/

> export additional metrics from scheduler driver
> ---
>
> Key: MESOS-3022
> URL: https://issues.apache.org/jira/browse/MESOS-3022
> Project: Mesos
>  Issue Type: Improvement
>Reporter: David Robinson
>Assignee: Yong Qiao Wang
>Priority: Minor
>
> The scheduler driver only exports the metrics below, but ideally it would 
> export its version and a count of messages by message type.
> {code}
> $ curl -s localhost:20902/metrics/snapshot | python -m json.tool
> {
> "scheduler/event_queue_dispatches": 0,
> "scheduler/event_queue_messages": 0,
> "system/cpus_total": 24,
> "system/load_15min": 0.49,
> "system/load_1min": 0.36,
> "system/load_5min": 0.46,
> "system/mem_free_bytes": 269713408,
> "system/mem_total_bytes": 33529266176
> }
> {code}
> The scheduler driver version could be used during troubleshooting to identify 
> frameworks that are using an old, potentially backwards incompatible, 
> scheduler driver (eg, a framework hasn't been restarted after a Mesos deploy, 
> so it still links against an old incompatible libmesos).
> A count of messages by message type would help identify a problem w/ a 
> specific feature, eg task reconciliation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3022) export additional metrics from scheduler driver

2015-09-05 Thread Yong Qiao Wang (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3022?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732139#comment-14732139
 ] 

Yong Qiao Wang commented on MESOS-3022:
---

Hi [~benjaminhindman], [~jieyu] and [~vinodkone], Cloud you help to review this 
patch? Thanks in advance!

> export additional metrics from scheduler driver
> ---
>
> Key: MESOS-3022
> URL: https://issues.apache.org/jira/browse/MESOS-3022
> Project: Mesos
>  Issue Type: Improvement
>Reporter: David Robinson
>Assignee: Yong Qiao Wang
>Priority: Minor
>
> The scheduler driver only exports the metrics below, but ideally it would 
> export its version and a count of messages by message type.
> {code}
> $ curl -s localhost:20902/metrics/snapshot | python -m json.tool
> {
> "scheduler/event_queue_dispatches": 0,
> "scheduler/event_queue_messages": 0,
> "system/cpus_total": 24,
> "system/load_15min": 0.49,
> "system/load_1min": 0.36,
> "system/load_5min": 0.46,
> "system/mem_free_bytes": 269713408,
> "system/mem_total_bytes": 33529266176
> }
> {code}
> The scheduler driver version could be used during troubleshooting to identify 
> frameworks that are using an old, potentially backwards incompatible, 
> scheduler driver (eg, a framework hasn't been restarted after a Mesos deploy, 
> so it still links against an old incompatible libmesos).
> A count of messages by message type would help identify a problem w/ a 
> specific feature, eg task reconciliation.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Assigned] (MESOS-3272) CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_FreezeNonFreezer is flaky.

2015-09-05 Thread Jian Qiu (JIRA)

 [ 
https://issues.apache.org/jira/browse/MESOS-3272?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jian Qiu reassigned MESOS-3272:
---

Assignee: Jian Qiu

> CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_FreezeNonFreezer is flaky.
> 
>
> Key: MESOS-3272
> URL: https://issues.apache.org/jira/browse/MESOS-3272
> Project: Mesos
>  Issue Type: Bug
>  Components: isolation
>Reporter: Paul Brett
>Assignee: Jian Qiu
> Attachments: build.log
>
>
> Test aborts when configured with python, libevent and SSL on Ubuntu12.
> [ RUN  ] 
> CgroupsAnyHierarchyWithCpuMemoryTest.ROOT_CGROUPS_FreezeNonFreezer
> *** Aborted at 1439667937 (unix time) try "date -d @1439667937" if you are 
> using GNU date ***
> PC: @ 0x7feba972a753 (unknown)
> *** SIGSEGV (@0x0) received by PID 4359 (TID 0x7febabf897c0) from PID 0; 
> stack trace: ***
> @ 0x7feba8f7dcb0 (unknown)
> @ 0x7feba972a753 (unknown)
> @ 0x7febaaa69328 process::dispatch<>()
> @ 0x7febaaa5e9a7 cgroups::freezer::thaw()
> @   0xba64ff 
> mesos::internal::tests::CgroupsAnyHierarchyWithCpuMemoryTest_ROOT_CGROUPS_FreezeNonFreezer_Test::TestBody()
> @   0xc199a3 
> testing::internal::HandleExceptionsInMethodIfSupported<>()
> @   0xc0f947 testing::Test::Run()
> @   0xc0f9ee testing::TestInfo::Run()
> @   0xc0faf5 testing::TestCase::Run()
> @   0xc0fda8 testing::internal::UnitTestImpl::RunAllTests()
> @   0xc10064 testing::UnitTest::Run()
> @   0x4b3273 main
> @ 0x7feba8bd176d (unknown)
> @   0x4bf1f1 (unknown)



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (MESOS-3157) only perform batch resource allocations

2015-09-05 Thread Guangya Liu (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3157?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732211#comment-14732211
 ] 

Guangya Liu commented on MESOS-3157:


[~bmahler]I have same comments as [~jamespeach] ,I think that current batch() 
allocation can almost achieve same behavior as your proposal and this is even 
more simple. Can you please show more comments for the benefit of your proposal 
and the current batch allocation? Thanks.

> only perform batch resource allocations
> ---
>
> Key: MESOS-3157
> URL: https://issues.apache.org/jira/browse/MESOS-3157
> Project: Mesos
>  Issue Type: Bug
>  Components: allocation
>Reporter: James Peach
>Assignee: James Peach
>
> Our deployment environments have a lot of churn, with many short-live 
> frameworks that often revive offers. Running the allocator takes a long time 
> (from seconds up to minutes).
> In this situation, event-triggered allocation causes the event queue in the 
> allocator process to get very long, and the allocator effectively becomes 
> unresponsive (eg. a revive offers message takes too long to come to the head 
> of the queue).
> We have been running a patch to remove all the event-triggered allocations 
> and only allocate from the batch task 
> {{HierarchicalAllocatorProcess::batch}}. This works great and really improves 
> responsiveness.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Comment Edited] (MESOS-3372) Allow mesos agent attributes to be tokenized in taskInfo

2015-09-05 Thread Chad Heuschober (JIRA)

[ 
https://issues.apache.org/jira/browse/MESOS-3372?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14732085#comment-14732085
 ] 

Chad Heuschober edited comment on MESOS-3372 at 9/5/15 7:48 PM:


Happy to do so; We have a (docker) application workload that needs to 
understand its resource locations to make the best possible decisions about 
where to look for services or data it consumes. Our networking model is calico 
ip-per-container.

The workload wants to prioritize its service discovery to use a service in the 
same node followed by rack, then rack pair, then across the dc. Without the 
ability to inject mesos-agent specific attributes into the taskInfo we have to 
jump out of the application and either ask the framework to discover or inject 
this information for us or use a service discovery agent (eg, mesos-dns or 
mesos-consul) that is, itself, dependent upon parsing what is exposed in 
`state.json`. We looked at anycast as well but it added too much complexity to 
the deployment at this time.

While any of the above are options, a framework is a big lift just to give an 
application its rack awareness and service discovery agents are in a difficult 
position knowing what should be exposed. Since they don't necessarily know what 
each workload needs, it feels like that's walking down the road of building a 
cartesian product of each application and slave attribute and it doesn't do 
anyone much good to just fork those projects to add the discovery information 
we want.

It is, in my mind at least, more elegant to allow slave-attributes to be 
templated into the taskInfo so they can be reused in environment variables or 
even task configuration scenarios.

I've seen (and please forgive me for not finding it right now) a JIRA issue 
that essentially requested some agent host information be auto discovered like 
CPU type, network interface, etc. While it would absolutely be neat to have 
that auto-discovered, even statically defining such attributes at the agent 
config and allowing them to be templated into TaskInfo can enable applications 
to be smarter about using the resources they're on.


was (Author: cheuschober):
Happy to do so; We have a (docker) application workload that needs to 
understand its resource locations to make the best possible decisions about 
where to look for services or data it consumes. Our networking model is calico 
ip-per-container.

The workload wants to prioritize its service discovery to use a service in the 
same node followed by rack, then rack pair, then across the dc. Without the 
ability to inject mesos-agent specific attributes into the taskInfo we have to 
jump out of the application and either ask the framework to discover or inject 
this information for us or use an service discovery agent (eg, mesos-dns or 
mesos-consul) that is, itself, dependent upon parsing what is exposed in 
`state.json`. We looked at anycast as well but it added too much complexity to 
the deployment at this time.

While any of the above are options, a framework is a big lift just to give an 
application its rack awareness and service discovery agents are in a difficult 
position knowing what should be exposed. Since they don't necessarily know what 
each workload needs, it feels like that's walking down the road of building a 
cartesian product of each application and slave attribute and it doesn't do 
anyone much good to just fork those projects to add the discovery information 
we want.

It is, in my mind at least, more elegant to allow slave-attributes to be 
templated into the taskInfo so they can be reused in environment variables or 
even task configuration scenarios.

I've seen (and please forgive me for not finding it right now) a JIRA issue 
that essentially requested some agent host information be auto discovered like 
CPU type, network interface, etc. While it would absolutely be neat to have 
that auto-discovered, even statically defining such attributes at the agent 
config and allowing them to be templated into TaskInfo can enable applications 
to be smarter about using the resources they're on.

> Allow mesos agent attributes to be tokenized in taskInfo
> 
>
> Key: MESOS-3372
> URL: https://issues.apache.org/jira/browse/MESOS-3372
> Project: Mesos
>  Issue Type: Improvement
>Reporter: Chad Heuschober
>
> Some applications workloads would benefit from having access to the 
> statically defined slave attributes. By processing `taskInfo` on the slave 
> such tokens, as defined in `taskInfo` could be replaced with the appropriate 
> values to achieve such objectives as rack locality.
> Example:
> Before token replacement:
> {code}
> {
>   "discovery": {
> "environment": "RACK_@MESOS.AGENT.ATTRS.RACK_ID@"
>   }
> }
> {code}
> After token replacement:
> {code}
>