Hi folks, I'm looking at a use cases that involves submitting potentially hundreds of jobs a second to our Mesos cluster. My tests show that the aurora client is taking 1-2 seconds for each job submission, and that I can run about four client processes in parallel before they peg the CPU at 100%. I need more throughput than this!
Squashing jobs down to the Process or Task level doesn't really make sense for our use case. I'm aware that with some shenanigans I can batch jobs together using job instances, but that's a lot of work on my current timeframe (and of questionable utility given that the jobs certainly won't have identical resource requirements). What I really need is (at least) an order of magnitude speedup in terms of being able to submit jobs to the Aurora scheduler (via the client or otherwise). Conceptually it doesn't seem like adding a job to a queue should be a thing that takes a couple of seconds, so I'm baffled as to why it's taking so long. As an experiment, I wrapped the call to client.execute() in client.py:proxy_main in cProfile and called aurora job create with a very simple test job. Results of the profile are in the Gist below: https://gist.github.com/helgridly/b37a0d27f04a37e72bb5 Our of a 0.977s profile time, the two things that stick out to me are: 1. 0.526s spent in Pystachio for a job that doesn't use any templates 2. 0.564s spent in create_job, presumably talking to the scheduler (and setting up the machinery for doing so) I imagine I can sidestep #1 with a check for "{{" in the job file and bypass Pystachio entirely. Can I also skip the Aurora client entirely and talk directly to the scheduler? If so what does that entail, and are there any risks associated? Thanks, -Hussein Hussein Elgridly Senior Software Engineer, DSDE The Broad Institute of MIT and Harvard