Re: 1.10.1 Release?

2018-11-05 Thread Bolke de Bruin
The fix is in master and should work across all DST changes. It will be 
included in 1.10.1. 

B. 

Sent from my iPhone

> On 5 Nov 2018, at 19:54, Dave Fisher  wrote:
> 
> 
> 
>> On 2018/10/28 00:09:05, Bolke de Bruin  wrote: 
>> I wonder how to treat this:
>> 
>> This is what I think happens (need to verify more, but I am pretty sure) the 
>> specified DAG should run every 5 minutes. At DST change (3AM -> 2AM)
> 
> FYI - In the US the DST change is 2AM -> 1AM. Yes, TZ is hard stuff.
> 
> we basically hit a schedule that we have already seen. 2AM -> 3AM has already 
> happened. Obviously the intention is to run every 5 minutes. But what do we 
> do with the execution_date? Is this still idempotent? Should we indeed 
> reschedule? 
>> 
>> B.
>> 
>>> On 30 Oct 2018, at 19:01, Ash Berlin-Taylor  wrote:
>>> 
>>> I've done a bit more digging - the issue is of our tz-aware handling inside 
>>> following_schedule (and previous schedule) - causing it to loop.
>>> 
>>> This section of the croniter docs seems relevant 
>>> https://github.com/kiorky/croniter#about-dst
>>> 
>>>   Be sure to init your croniter instance with a TZ aware datetime for this 
>>> to work !:
>> local_date = tz.localize(datetime(2017, 3, 26))
>> val = croniter('0 0 * * *', local_date).get_next(datetime)
>>> 
>>> I think the problem is that we are _not_ passing a TZ aware dag in and we 
>>> should be.
>>> 
 On 30 Oct 2018, at 17:35, Bolke de Bruin  wrote:
 
 Oh that’s a great environment to start digging. Thanks. I’ll have a look.
 
 B.
 
 Verstuurd vanaf mijn iPad
 
> Op 30 okt. 2018 om 18:25 heeft Ash Berlin-Taylor  het 
> volgende geschreven:
> 
> This line in airflow.jobs (line 874 in my checkout) is causing the loop:
> 
> last_run = dag.get_last_dagrun(session=session)
> if last_run and next_run_date:
> while next_run_date <= last_run.execution_date:
> next_run_date = dag.following_schedule(next_run_date)
> 
> 
> 
>> On 30 Oct 2018, at 17:20, Ash Berlin-Taylor  wrote:
>> 
>> Hi, kaczors on gitter has produced a minmal reproduction case: 
>> https://github.com/kaczors/airflow_1_10_tz_bug
>> 
>> Rough repro steps: In a VM, with time syncing disabled, and configured 
>> with system timezone of Europe/Zurich (or any other CEST one) run 
>> 
>> - `date 10280250.00`
>> - initdb, start scheduler, webserver, enable dag etc.
>> - `date 10280259.00`
>> - wait 5-10 mins for scheduler to catch up
>> - After the on-the-hour task run the scheduler will spin up another 
>> process to parse the dag... and it never returns.
>> 
>> I've only just managed to reproduce it, so haven't dug in to why yet. A 
>> quick hacky debug print shows something is stuck in an infinite loop.
>> 
>> -ash
>> 
>>> On 29 Oct 2018, at 17:59, Bolke de Bruin  wrote:
>>> 
>>> Can this be confirmed? Then I can have a look at it. Preferably with 
>>> dag definition code.
>>> 
>>> On the licensing requirements:
>>> 
>>> 1. Indeed licensing header for markdown documents. It was suggested to 
>>> use html comments. I’m not sure how that renders with others like PDF 
>>> though.
>>> 2. The licensing notifications need to be tied to a specific version as 
>>> licenses might change with versions.
>>> 
>>> Cheers
>>> Bolke
>>> 
>>> Verstuurd vanaf mijn iPad
>>> 
 Op 29 okt. 2018 om 12:39 heeft Ash Berlin-Taylor  het 
 volgende geschreven:
 
 I was going to make a start on the release, but two people have 
 reported that there might be an issue around non-UTC dags and the 
 scheduler changing over from Summer time.
 
> 08:45 Emmanuel> Hi there, we are currently experiencing a very 
> strange issue : we have hourly DAGs with a start_date in a local 
> timezone (not UTC) and since (Sunday) the last winter time change 
> they don’t run anymore. Any idea ?
> 09:41  it impacted all our DAG that had a run at 3am 
> (Europe/Paris), the exact time of winter time change :(
 
 I am going to take a look at this today and see if I can get to the 
 bottom of it.
 
 Bolke: are there any outstanding tasks/issues that you know of that 
 might slow down the vote for a 1.10.1? (i.e. did we sort of out all 
 the licensing issues that were asked of us? I thought I read something 
 about license declarations in markdown files?)
 
 -ash
 
> On 28 Oct 2018, at 14:46, Bolke de Bruin  wrote:
> 
> I agree with that, but I would favor time based releases instead. We 
> are again at the point that a release takes so much time that the gap 
> is getting really big again. @ash why not start releasing now and 

Re: 1.10.1 Release?

2018-11-05 Thread Dave Fisher



On 2018/10/28 00:09:05, Bolke de Bruin  wrote: 
> I wonder how to treat this:
> 
> This is what I think happens (need to verify more, but I am pretty sure) the 
> specified DAG should run every 5 minutes. At DST change (3AM -> 2AM)

FYI - In the US the DST change is 2AM -> 1AM. Yes, TZ is hard stuff.

 we basically hit a schedule that we have already seen. 2AM -> 3AM has already 
happened. Obviously the intention is to run every 5 minutes. But what do we do 
with the execution_date? Is this still idempotent? Should we indeed reschedule? 
> 
> B.
> 
> > On 30 Oct 2018, at 19:01, Ash Berlin-Taylor  wrote:
> > 
> > I've done a bit more digging - the issue is of our tz-aware handling inside 
> > following_schedule (and previous schedule) - causing it to loop.
> > 
> > This section of the croniter docs seems relevant 
> > https://github.com/kiorky/croniter#about-dst
> > 
> >Be sure to init your croniter instance with a TZ aware datetime for this 
> > to work !:
>  local_date = tz.localize(datetime(2017, 3, 26))
>  val = croniter('0 0 * * *', local_date).get_next(datetime)
> > 
> > I think the problem is that we are _not_ passing a TZ aware dag in and we 
> > should be.
> > 
> >> On 30 Oct 2018, at 17:35, Bolke de Bruin  wrote:
> >> 
> >> Oh that’s a great environment to start digging. Thanks. I’ll have a look.
> >> 
> >> B.
> >> 
> >> Verstuurd vanaf mijn iPad
> >> 
> >>> Op 30 okt. 2018 om 18:25 heeft Ash Berlin-Taylor  het 
> >>> volgende geschreven:
> >>> 
> >>> This line in airflow.jobs (line 874 in my checkout) is causing the loop:
> >>> 
> >>>  last_run = dag.get_last_dagrun(session=session)
> >>>  if last_run and next_run_date:
> >>>  while next_run_date <= last_run.execution_date:
> >>>  next_run_date = dag.following_schedule(next_run_date)
> >>> 
> >>> 
> >>> 
>  On 30 Oct 2018, at 17:20, Ash Berlin-Taylor  wrote:
>  
>  Hi, kaczors on gitter has produced a minmal reproduction case: 
>  https://github.com/kaczors/airflow_1_10_tz_bug
>  
>  Rough repro steps: In a VM, with time syncing disabled, and configured 
>  with system timezone of Europe/Zurich (or any other CEST one) run 
>  
>  - `date 10280250.00`
>  - initdb, start scheduler, webserver, enable dag etc.
>  - `date 10280259.00`
>  - wait 5-10 mins for scheduler to catch up
>  - After the on-the-hour task run the scheduler will spin up another 
>  process to parse the dag... and it never returns.
>  
>  I've only just managed to reproduce it, so haven't dug in to why yet. A 
>  quick hacky debug print shows something is stuck in an infinite loop.
>  
>  -ash
>  
> > On 29 Oct 2018, at 17:59, Bolke de Bruin  wrote:
> > 
> > Can this be confirmed? Then I can have a look at it. Preferably with 
> > dag definition code.
> > 
> > On the licensing requirements:
> > 
> > 1. Indeed licensing header for markdown documents. It was suggested to 
> > use html comments. I’m not sure how that renders with others like PDF 
> > though.
> > 2. The licensing notifications need to be tied to a specific version as 
> > licenses might change with versions.
> > 
> > Cheers
> > Bolke
> > 
> > Verstuurd vanaf mijn iPad
> > 
> >> Op 29 okt. 2018 om 12:39 heeft Ash Berlin-Taylor  het 
> >> volgende geschreven:
> >> 
> >> I was going to make a start on the release, but two people have 
> >> reported that there might be an issue around non-UTC dags and the 
> >> scheduler changing over from Summer time.
> >> 
> >>> 08:45 Emmanuel> Hi there, we are currently experiencing a very 
> >>> strange issue : we have hourly DAGs with a start_date in a local 
> >>> timezone (not UTC) and since (Sunday) the last winter time change 
> >>> they don’t run anymore. Any idea ?
> >>> 09:41  it impacted all our DAG that had a run at 3am 
> >>> (Europe/Paris), the exact time of winter time change :(
> >> 
> >> I am going to take a look at this today and see if I can get to the 
> >> bottom of it.
> >> 
> >> Bolke: are there any outstanding tasks/issues that you know of that 
> >> might slow down the vote for a 1.10.1? (i.e. did we sort of out all 
> >> the licensing issues that were asked of us? I thought I read something 
> >> about license declarations in markdown files?)
> >> 
> >> -ash
> >> 
> >>> On 28 Oct 2018, at 14:46, Bolke de Bruin  wrote:
> >>> 
> >>> I agree with that, but I would favor time based releases instead. We 
> >>> are again at the point that a release takes so much time that the gap 
> >>> is getting really big again. @ash why not start releasing now and 
> >>> move the remainder to 1.10.2? I dont think there are real blockers 
> >>> (although we might find them).
> >>> 
> >>> 
>  On 28 Oct 2018, at 15:35, airflowuser 
> 

Re: REST API roadmap/plan?

2018-11-05 Thread Verdan Mahmood
Hi Mathew,
As Fokko mentioned, we don't have any roadmap for the REST API at
the moment,
but we can definitely start planning this out, gathering the list of APIs,
review them together and making a roadmap.
I'd be happy to help with this :)

Few things:
- I started out with a list of possible APIs a couple of months ago.
https://issues.apache.org/jira/browse/AIRFLOW-2628
- The experimental endpoints are functional views and scattered with no
modular approach in place.
We can definitely look for options like FAB REST API or Flask RESTful
https://flask-appbuilder.readthedocs.io/en/latest/quickhowto.html#rest-api
- Please consider using "www_rbac" for any new endpoint as we do have plans
to deprecate
the older "www" version of UI soon.

Best,
*Verdan Mahmood*

On Mon, Nov 5, 2018 at 4:34 PM Driesprong, Fokko 
wrote:

> Thanks Matthew,
>
> We're working on getting the experimental API. This API should be used for
> both the GUI and can be used for external systems to interface with Airflow
> (triggering dags for example). Verdan was working on this, but for now,
> this process is a bit stuck again. It turns out that data engineers aren't
> really good at doing front end.
> There isn't a full roadmap at the moment. There are some tickets, but
> nothing with a full description:
>
> https://issues.apache.org/jira/browse/AIRFLOW-890?jql=project%20%3D%20AIRFLOW%20AND%20status%20%3D%20Open%20AND%20text%20~%20%22REST%22
>
> Please feel free to pick this up and get rid of the logic from the GUI (web
> interface), or extending the experimental API with sensible endpoints :-)
>
> If there are any questions, let me know.
>
> Cheers, Fokko
>
>
>
> Op wo 31 okt. 2018 om 21:00 schreef matthew :
>
> > I've been poking around Jira and confluence but haven't seen any roadmap
> or
> > plans for the REST API.  Did I just miss it or has it stalled out?
> >
> > I'm interested in working on it if it needs some help.
> >
> > Thanks
> > -Matthew
> >
>


Re: A Naive Multi-Scheduler Architecture Experiment of Airflow

2018-11-05 Thread Deng Xiaodong
Thanks Devjyoti for your reply.

To elaborate based on your inputs: 

- *When to add one more shard*:
We have designed some metrics, like "how long the scheduler instance takes to 
parse & schedule all DAGs (in the subdir it’s taking care of)". When the metric 
is higher than a given threshold for long enough time, we may want to add one 
more shard. 

- *Easy Solution to Balance Shard Load*:
Exactly the same as you’re pointing out, we create initial set of shards by 
randomly distribute our DAGs into each subdir. Similar to building a 
mathematical model, there are some assumptions we have to make for convenience, 
like “complexity of DAGs are roughly equal”.
As for new DAGs: we developed an application creating DAGs based on metadata, 
and the application would check the # of files in each subdir and always put 
the new DAG into the subdir with the least # of DAGs.


XD

> On 2 Nov 2018, at 12:47 AM, Devjyoti Patra  wrote:
> 
>>> 1. “Shard by # of files may not yield same load”: fully agree with you.
> This concern was also raised by other co-workers in my team. But given this
> is a preliminary trial, we didn’t consider this yet.
> 
> One issue here is that when do you decide to add one more shard? I think if
> you monitor the time it takes to parse each source file and log it; you can
> use this to find the outliers when your scheduling SLA is breached and move
> the outliers to a new shard. Creating the initial set of shard by randomly
> putting an equal number of files in each subdir seems like the easiest way
> to approach this problem.
> 
> On Thu, Nov 1, 2018 at 7:11 PM Deng Xiaodong  wrote:
> 
>> Thanks Kelvin and Max for your inputs!
>> 
>> To Kelvin’s questions:
>> 1. “Shard by # of files may not yield same load”: fully agree with you.
>> This concern was also raised by other co-workers in my team. But given this
>> is a preliminary trial, we didn’t consider this yet.
>> 2. We haven’t started to look into how we can dynamically allocate
>> scheduler resource yet. But I think this preliminary trial would be a good
>> starting point.
>> 3. DB: look forward to your PR on this!
>> 4. “Why do you need to shard the scheduler while the scheduler can scale
>> up pretty high”
>> There are a few reasons:
>> 4.1 we have strict SLA on scheduling. We expect one scheduling loop takes
>> < 3 minutes no matter how many DAGs we have
>> 4.2 we’re containerising the deployment, while our infrastructure team
>> added the restriction that for each pod we can only use up to 2 cores
>> (blocked us from scaling vertically).
>> 4.3 even though this naive architecture doesn’t provide HA, actually it
>> partially addresses the availability concern (if one scheduler out of 5
>> fails, at least 80% DAGs can still be scheduled properly).
>> 
>> To Max’s questions:
>> 1. I haven’t tested pools or queues features with this architecture. So
>> can’t give a very firm answer on this.
>> 2. In the load tests I have done, I haven’t observed such “misfires” yet
>> (I’m running a customised version based on 1.10.0 BTW)
>> 3. This is a very valid point. I haven’t checked the implementation of DAG
>> prioritisation in detail yet. For the scenario in our team, we don’t
>> prioritise DAGs, so we didn’t take this into consideration. On the other
>> hand, this naive architecture didn’t change anything in Airflow. It simply
>> makes use of the “--subdir” argument of scheduler command. If we want to
>> have a more serious multi-scheduler setting-up natively supported by
>> Airflow, I believe for sure we need to make significant changes to the code
>> to ensure all features, like cross DAG prioritisation, are supported.
>> 
>> 
>> Kindly let me know your thoughts. Thanks!
>> 
>> XD
>> 
>> 
>>> On 1 Nov 2018, at 4:25 AM, Maxime Beauchemin 
>> wrote:
>>> 
>>> A few related thoughts:
>>> * there may be hiccups around concurrency (pools, queues), though the
>> worker should double-checks that the constraints are still met when firing
>> the task, so in theory this should be ok
>>> * there may be more "misfires" meaning the task gets sent to the worker,
>> but by the time it starts the conditions aren't met anymore because of a
>> race condition with one of the other schedulers. Here I'm assuming recent
>> versions of Airflow will simply eventually re-fire the misfires and heal
>>> * cross DAG prioritization can't really take place anymore as there's
>> not a shared "ready-to-run" list of task instances that can be sorted by
>> priority_weight. Whichever scheduler instance fires first is likely to get
>> the open slots first.
>>> 
>>> Max
>>> 
>>> 
>>> On Wed, Oct 31, 2018 at 1:00 PM Kevin Yang > yrql...@gmail.com>> wrote:
>>> Finally we start to talk about this seriously? Yeah! :D
>>> 
>>> For your approach, a few thoughts:
>>> 
>>>   1. Shard by # of files may not yield same load--even very different
>> load
>>>   since we may have some framework DAG file producing 500 DAG and take
>>>   forever to parse.
>>>   2. I think Alex Guziel 

Re: REST API roadmap/plan?

2018-11-05 Thread Driesprong, Fokko
Thanks Matthew,

We're working on getting the experimental API. This API should be used for
both the GUI and can be used for external systems to interface with Airflow
(triggering dags for example). Verdan was working on this, but for now,
this process is a bit stuck again. It turns out that data engineers aren't
really good at doing front end.
There isn't a full roadmap at the moment. There are some tickets, but
nothing with a full description:
https://issues.apache.org/jira/browse/AIRFLOW-890?jql=project%20%3D%20AIRFLOW%20AND%20status%20%3D%20Open%20AND%20text%20~%20%22REST%22

Please feel free to pick this up and get rid of the logic from the GUI (web
interface), or extending the experimental API with sensible endpoints :-)

If there are any questions, let me know.

Cheers, Fokko



Op wo 31 okt. 2018 om 21:00 schreef matthew :

> I've been poking around Jira and confluence but haven't seen any roadmap or
> plans for the REST API.  Did I just miss it or has it stalled out?
>
> I'm interested in working on it if it needs some help.
>
> Thanks
> -Matthew
>


Feed Back on Remote Dag Fetching Proposal

2018-11-05 Thread Ian Davison
Hey Everyone,

I just created a first draft of an Airflow Improvement Proposal for Remote Dag 
Fetching https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-5+DagFetcher
The initial PR is located here: 
https://github.com/apache/incubator-airflow/pull/3138

I tried to go through and encapsulate all the various ideas discussed in the 
PR, but I’m sure I missed some. As well, there were a number of design 
decisions I felt should be further discussed by the community. I would really 
appreciate people’s feedback so we can start working towards a fully fleshed 
out design. I think the ability to remotely fetch dags, and to stop crawling a 
file system is the next big improvement for Airflow.

Thanks!
Ian Davison
The content of this e-mail message and any attached files transmitted with it 
are to be treated as confidential information and are intended solely for the 
use of the individual or entity to whom they are addressed. If the reader of 
this e-mail is not the intended recipient or his or her authorized agent, the 
reader is hereby notified that any reproduction, distribution, or disclosure of 
this e-mail is prohibited. If you have received this e-mail in error, please 
notify the sender by replying to this message and delete this e-mail 
immediately.***1010data, Inc. and its affiliates ("1010data") are not 
responsible for any advice concerning the use of its software or services or 
the data manipulated by its software, except to the extent 1010data has 
specifically undertaken such responsibility in a validly binding contract. 
1010data shall have no responsibility for any decision concerning the 
appropriate method of use or application of its software or services or the 
data therein in connection with any transaction. Any decision concerning how, 
where and when to use this facility remains the sole responsibility of the 
user. NO AGREEMENTS MAY BE ENTERED INTO WITH ANY 1010DATA COMPANY VIA EMAIL 
EXCHANGE REGARDLESS OF THE TITLE OF THE AUTHOR OF THE EMAILS. ALL AGREEMENTS 
MUST BE IN THE FORM OF A FORMAL WRITTEN AGREEMENT SIGNED BY THE CHIEF EXECUTIVE 
OFFICER, PRESIDENT, CHIEF OPERATING OFFICER, CHIEF FINANCIAL OFFICER OR CHIEF 
PEOPLE OFFICER OF 1010DATA AND AN AUTHORIZED REPRESENTATIVE OF THE OTHER PARTY.
Hey Everyone,

I just created a first draft of an Airflow Improvement Proposal for Remote Dag 
Fetching https://cwiki.apache.org/confluence/display/AIRFLOW/AIP-5+DagFetcher
The initial PR is located here: 
https://github.com/apache/incubator-airflow/pull/3138

I tried to go through and encapsulate all the various ideas discussed in the 
PR, but I’m sure I missed some. As well, there were a number of design 
decisions I felt should be further discussed by the community. I would really 
appreciate people’s feedback so we can start working towards a fully fleshed 
out design. I think the ability to remotely fetch dags, and to stop crawling a 
file system is the next big improvement for Airflow.

Thanks!
Ian Davison
The content of this e-mail message and any attached files transmitted with it 
are to be treated as confidential information and are intended solely for the 
use of the individual or entity to whom they are addressed. If the reader of 
this e-mail is not the intended recipient or his or her authorized agent, the 
reader is hereby notified that any reproduction, distribution, or disclosure of 
this e-mail is prohibited. If you have received this e-mail in error, please 
notify the sender by replying to this message and delete this e-mail 
immediately.***1010data, Inc. and its affiliates ("1010data") are not 
responsible for any advice concerning the use of its software or services or 
the data manipulated by its software, except to the extent 1010data has 
specifically undertaken such responsibility in a validly binding contract. 
1010data shall have no responsibility for any decision concerning the 
appropriate method of use or application of its software or services or the 
data therein in connection with any transaction. Any decision concerning how, 
where and when to use this facility remains the sole responsibility of the 
user. NO AGREEMENTS MAY BE ENTERED INTO WITH ANY 1010DATA COMPANY VIA EMAIL 
EXCHANGE REGARDLESS OF THE TITLE OF THE AUTHOR OF THE EMAILS. ALL AGREEMENTS 
MUST BE IN THE FORM OF A FORMAL WRITTEN AGREEMENT SIGNED BY THE CHIEF EXECUTIVE 
OFFICER, PRESIDENT, CHIEF OPERATING OFFICER, CHIEF FINANCIAL OFFICER OR CHIEF 
PEOPLE OFFICER OF 1010DATA AND AN AUTHORIZED REPRESENTATIVE OF THE OTHER PARTY.