[
https://issues.apache.org/jira/browse/MESOS-3307?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15088554#comment-15088554
]
Kevin Klues commented on MESOS-3307:
------------------------------------
I have submitted a patch for review based on Felix's pull request (with some
modifications):
https://reviews.apache.org/r/42053/
This patch adds configure flags for setting the buffer size of the completed
frameworks and tasks_per_framework variables for the state.json (and related)
endpoints. This combined with MESOS-2353 for significantly reducing the time
it takes to generate state.json *should* resolve the ticket addressed here.
However, in the long term things like mesos-dns *should* use the "Mesos Master
Event Streaming" API that Alexander Rukletsov and others are working once it is
completed. This will make bandaid solutions like this one unnecessary.
Also, keep in mind, the use of these newly introduced flags will only help if
you are in charge of running your master configuration. If you are using
something like the Mesosphere DCOS to automatically set up your master/agent
configuration, then these flags will likely not be of much help because their
default values will remain as they were before.
> Configurable size of completed task / framework history
> -------------------------------------------------------
>
> Key: MESOS-3307
> URL: https://issues.apache.org/jira/browse/MESOS-3307
> Project: Mesos
> Issue Type: Bug
> Reporter: Ian Babrou
> Assignee: Kevin Klues
> Labels: mesosphere
>
> We try to make Mesos work with multiple frameworks and mesos-dns at the same
> time. The goal is to have set of frameworks per team / project on a single
> Mesos cluster.
> At this point our mesos state.json is at 4mb and it takes a while to
> assembly. 5 mesos-dns instances hit state.json every 5 seconds, effectively
> pushing mesos-master CPU usage through the roof. It's at 100%+ all the time.
> Here's the problem:
> {noformat}
> mesos λ curl -s http://mesos-master:5050/master/state.json | jq
> .frameworks[].completed_tasks[].framework_id | sort | uniq -c | sort -n
> 1 "20150606-001827-252388362-5050-5982-0003"
> 16 "20150606-001827-252388362-5050-5982-0005"
> 18 "20150606-001827-252388362-5050-5982-0029"
> 73 "20150606-001827-252388362-5050-5982-0007"
> 141 "20150606-001827-252388362-5050-5982-0009"
> 154 "20150820-154817-302720010-5050-15320-0000"
> 289 "20150606-001827-252388362-5050-5982-0004"
> 510 "20150606-001827-252388362-5050-5982-0012"
> 666 "20150606-001827-252388362-5050-5982-0028"
> 923 "20150116-002612-269165578-5050-32204-0003"
> 1000 "20150606-001827-252388362-5050-5982-0001"
> 1000 "20150606-001827-252388362-5050-5982-0006"
> 1000 "20150606-001827-252388362-5050-5982-0010"
> 1000 "20150606-001827-252388362-5050-5982-0011"
> 1000 "20150606-001827-252388362-5050-5982-0027"
> mesos λ fgrep 1000 -r src/master
> src/master/constants.cpp:const size_t MAX_REMOVED_SLAVES = 100000;
> src/master/constants.cpp:const uint32_t MAX_COMPLETED_TASKS_PER_FRAMEWORK =
> 1000;
> {noformat}
> Active tasks are just 6% of state.json response:
> {noformat}
> mesos λ cat ~/temp/mesos-state.json | jq -c . | wc
> 1 14796 4138942
> mesos λ cat ~/temp/mesos-state.json | jq .frameworks[].tasks | jq -c . | wc
> 16 37 252774
> {noformat}
> I see four options that can improve the situation:
> 1. Add query string param to exclude completed tasks from state.json and use
> it in mesos-dns and similar tools. There is no need for mesos-dns to know
> about completed tasks, it's just extra load on master and mesos-dns.
> 2. Make history size configurable.
> 3. Make JSON serialization faster. With 10000s of tasks even without history
> it would take a lot of time to serialize tasks for mesos-dns. Doing it every
> 60 seconds instead of every 5 seconds isn't really an option.
> 4. Create event bus for mesos master. Marathon has it and it'd be nice to
> have it in Mesos. This way mesos-dns could avoid polling master state and
> switch to listening for events.
> All can be done independently.
> Note to mesosphere folks: please start distributing debug symbols with your
> distribution. I was asking for it for a while and it is really helpful:
> https://github.com/mesosphere/marathon/issues/1497#issuecomment-104182501
> Perf report for leading master:
> !http://i.imgur.com/iz7C3o0.png!
> I'm on 0.23.0.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)