Speeding up Aurora client job creation

Hussein Elgridly Wed, 11 Feb 2015 11:47:37 -0800

Hi folks,

I'm looking at a use cases that involves submitting potentially hundreds of
jobs a second to our Mesos cluster. My tests show that the aurora client is
taking 1-2 seconds for each job submission, and that I can run about four
client processes in parallel before they peg the CPU at 100%. I need more
throughput than this!


Squashing jobs down to the Process or Task level doesn't really make sense
for our use case. I'm aware that with some shenanigans I can batch jobs
together using job instances, but that's a lot of work on my current
timeframe (and of questionable utility given that the jobs certainly won't
have identical resource requirements).

What I really need is (at least) an order of magnitude speedup in terms of
being able to submit jobs to the Aurora scheduler (via the client or
otherwise).

Conceptually it doesn't seem like adding a job to a queue should be a thing
that takes a couple of seconds, so I'm baffled as to why it's taking so
long. As an experiment, I wrapped the call to client.execute() in
client.py:proxy_main in cProfile and called aurora job create with a very
simple test job.

Results of the profile are in the Gist below:

https://gist.github.com/helgridly/b37a0d27f04a37e72bb5

Our of a 0.977s profile time, the two things that stick out to me are:

1. 0.526s spent in Pystachio for a job that doesn't use any templates
2. 0.564s spent in create_job, presumably talking to the scheduler (and
setting up the machinery for doing so)

I imagine I can sidestep #1 with a check for "{{" in the job file and
bypass Pystachio entirely. Can I also skip the Aurora client entirely and
talk directly to the scheduler? If so what does that entail, and are there
any risks associated?

Thanks,
-Hussein

Hussein Elgridly
Senior Software Engineer, DSDE
The Broad Institute of MIT and Harvard

Speeding up Aurora client job creation

Reply via email to