Hi Max,

We seem to be in quite good order already. We are testing with multi master 
mysql and will also test multi master Postgres. As we are doing dagrun level 
locking already it does not seem to be required to do DAG-level locking. Also 
tasks are being locked so if multiple schedulers are running everything seems 
to be quite fine. If one of the schedulers restarts it starts checking for 
orphaned tasks by checking the executor queue which is unique for every 
scheduler. This will result it some tasks being dequeued and then requeued. So 
airflow is robust enough to stay alive then (with my patch for deadlocks 
applied), but some things are a bit sub-optimal.

As mentioned we are still stress testing this setup and we might find more.

Bolke

> On 22 May 2017, at 18:19, Maxime Beauchemin <[email protected]> 
> wrote:
> 
> Things that might be needed for a correct multi-schedulers setup:
> * DAG-level lock while being evaluated
> * DAG-level lock expiration to recover from potential situation where the
> lock wasn't released
> * Accumulation of the list of task instances to run into the database (as
> opposed to cross process communication to master process)
> * Define a clear master cycle that would read the list of accumulated task
> instances from the DB, dedup, prioritize and schedule. That master cycle
> should have a lock (and lock expiration) as well.
> 
> Max
> 
> On Mon, May 22, 2017 at 12:27 AM, Bolke de Bruin <[email protected]> wrote:
> 
>> Hi Stephen,
>> 
>> We are currently stress testing Airflow for use in a multi-master setup.
>> One of my team members is doing a write up that should show up online
>> shortly. TL;DR; in its current state Airflow will need some patches in
>> order to run concurrently. One issue is that Airflow can have a database
>> deadlock which will stop the scheduler from running. I have a patch for
>> that out here (https://github.com/apache/incubator-airflow/pull/2267 <
>> https://github.com/apache/incubator-airflow/pull/2267>) that works fine
>> on Postgres/MySql (tests don’t pass on sqlite yet due to limitations of
>> sqlite).
>> 
>> Your global scheduler lock (eg. by an active passive configuration) might
>> make most sense for now.
>> 
>> Bolke
>> 
>>> On 22 May 2017, at 07:52, Stephen Rigney <[email protected]> wrote:
>>> 
>>> Hi,
>>> 
>>> We're running airflow in production, but for reliability (n.b. not
>>> performance) we'd like to confirm if it is safe to spawn multiple
>> instances
>>> of the scheduler overlapping in time (otherwise we may need to put more
>>> effort into assuring two copies aren't ever spawned at once in our
>>> environment).
>>> 
>>> 
>>> It seems this officially wasn't a supported configuration back in 2015 (
>>> https://groups.google.com/d/msg/airbnb_airflow/-1wKa3OcwME/uATa8y3YDAAJ
>> ),
>>> but has sufficient intra-airflow locking been added that it is now safe
>> to
>>> start up two temporally overlapping instances of the scheduler for the
>> same
>>> airflow system?
>>> 
>>> 
>>> Or should we hack in a "global scheduler lock" - we're not looking for
>>> increased performance by scheduler parallelism, just that if we ever fire
>>> up two instances of the scheduler nothing terrible happens?
>>> 
>>> 
>>> Stephen
>> 
>> 

Reply via email to