I did run into "double SLA miss alarms" firing, but that was on 1.7x. I haven't tested if that is still an issue in 1.8x.
-s On Tue, May 23, 2017 at 8:46 AM, Maxime Beauchemin < [email protected]> wrote: > Awesome. I wasn't aware of DagRun locking, this is even better! > > Max > > On Mon, May 22, 2017 at 11:39 PM, Bolke de Bruin <[email protected]> > wrote: > > > Hi Max, > > > > We seem to be in quite good order already. We are testing with multi > > master mysql and will also test multi master Postgres. As we are doing > > dagrun level locking already it does not seem to be required to do > > DAG-level locking. Also tasks are being locked so if multiple schedulers > > are running everything seems to be quite fine. If one of the schedulers > > restarts it starts checking for orphaned tasks by checking the executor > > queue which is unique for every scheduler. This will result it some tasks > > being dequeued and then requeued. So airflow is robust enough to stay > alive > > then (with my patch for deadlocks applied), but some things are a bit > > sub-optimal. > > > > As mentioned we are still stress testing this setup and we might find > more. > > > > Bolke > > > > > On 22 May 2017, at 18:19, Maxime Beauchemin < > [email protected]> > > wrote: > > > > > > Things that might be needed for a correct multi-schedulers setup: > > > * DAG-level lock while being evaluated > > > * DAG-level lock expiration to recover from potential situation where > the > > > lock wasn't released > > > * Accumulation of the list of task instances to run into the database > (as > > > opposed to cross process communication to master process) > > > * Define a clear master cycle that would read the list of accumulated > > task > > > instances from the DB, dedup, prioritize and schedule. That master > cycle > > > should have a lock (and lock expiration) as well. > > > > > > Max > > > > > > On Mon, May 22, 2017 at 12:27 AM, Bolke de Bruin <[email protected]> > > wrote: > > > > > >> Hi Stephen, > > >> > > >> We are currently stress testing Airflow for use in a multi-master > setup. > > >> One of my team members is doing a write up that should show up online > > >> shortly. TL;DR; in its current state Airflow will need some patches in > > >> order to run concurrently. One issue is that Airflow can have a > database > > >> deadlock which will stop the scheduler from running. I have a patch > for > > >> that out here (https://github.com/apache/incubator-airflow/pull/2267 > < > > >> https://github.com/apache/incubator-airflow/pull/2267>) that works > fine > > >> on Postgres/MySql (tests don’t pass on sqlite yet due to limitations > of > > >> sqlite). > > >> > > >> Your global scheduler lock (eg. by an active passive configuration) > > might > > >> make most sense for now. > > >> > > >> Bolke > > >> > > >>> On 22 May 2017, at 07:52, Stephen Rigney <[email protected]> wrote: > > >>> > > >>> Hi, > > >>> > > >>> We're running airflow in production, but for reliability (n.b. not > > >>> performance) we'd like to confirm if it is safe to spawn multiple > > >> instances > > >>> of the scheduler overlapping in time (otherwise we may need to put > more > > >>> effort into assuring two copies aren't ever spawned at once in our > > >>> environment). > > >>> > > >>> > > >>> It seems this officially wasn't a supported configuration back in > 2015 > > ( > > >>> https://groups.google.com/d/msg/airbnb_airflow/- > > 1wKa3OcwME/uATa8y3YDAAJ > > >> ), > > >>> but has sufficient intra-airflow locking been added that it is now > safe > > >> to > > >>> start up two temporally overlapping instances of the scheduler for > the > > >> same > > >>> airflow system? > > >>> > > >>> > > >>> Or should we hack in a "global scheduler lock" - we're not looking > for > > >>> increased performance by scheduler parallelism, just that if we ever > > fire > > >>> up two instances of the scheduler nothing terrible happens? > > >>> > > >>> > > >>> Stephen > > >> > > >> > > > > >
