Re: Deploy Airflow on Kubernetes using Airflow Operator

2018-08-12 Thread Rob Goretsky
Barni,
Thank you so much for sharing this!  I'm admittedly far from a Kubernetes
guru, but I'm just trying to wrap my head around the reasons why we'd need
a custom Kubernetes controller to manage Airflow's components, as opposed
to the setup here https://github.com/mumoshu/kube-airflow wherein we simply
deploy Airflow to Kubernetes using standard Kubernetes Services and
Deployments.  I assume you're doing this because, as you mention in the
"Design" section of your repo, "In case of some stateful applications, the
declarative models provided by kubernetes are not sufficient to handle
fault remediation, scaling with data integrity and availability".   Can you
be more specific about what kinds of faults / scaling issues your
Kubernetes Controller might handle for Airflow that otherwise would not be
handled by built-in Kubernetes controllers (Services/Deployments)?

Thanks,
Rob



On Mon, Aug 6, 2018 at 1:55 AM Bolke de Bruin  wrote:

> Really awesome stuff. We are in progress to move over to k8s for Airflow
> (on prem though) and this is really helpful.
>
> B.
>
> Verstuurd vanaf mijn iPad
>
> > Op 3 aug. 2018 om 23:35 heeft Barni Seetharaman 
> het volgende geschreven:
> >
> > Hi
> >
> > We at Google just open-sourced a Kubernetes custom controller (also
> called
> > operator) to make deploying and managing Airflow on kubernetes simple.
> > The operator pattern is a power abstraction in kubernetes.
> > Please watch this repo (in the process of adding docs) for further
> updates.
> >
> > https://github.com/GoogleCloudPlatform/airflow-operator
> >
> > Do reach out if you have any questions.
> >
> > Also created a channel in kubernetes slack  (#airflow-operator)
> >  for any
> > discussions specific to Airflow on Kubernetes (including Daniel's
> > Kubernetes Executor, Kuberenetes operator and this custom controller also
> > called kuberntes airflow operator).
> >
> > regs
> > Barni
>


Re: best way to handle version upgrades of libraries used by tasks

2018-02-09 Thread Rob Goretsky
My team has solved for this with Docker.   When a developer works on a
single project, they freeze their Python library versions via
pip freeze > requirements.txt
for that project, And then we build one Docker image per project, using
something very similar to the official 'onbuild' version of the Python
Docker image from here https://hub.docker.com/_/python/..
We have Jenkins automatically build and push an updated image per project
to ECR whenever code is pushed to GitHub's master branch for that project.

This means we have currently 80 different Docker images (one per project)
stored in ECR, but each one is completely isolated from each other in terms
of their dependencies.   This means we never have to worry about the impact
of upgrading a python library version for anything but the current project
we're working on..  This has opened up some nice opportunities to start
playing more with Python 3.x while keeping all of our older stuff running
smoothly on Python 2.7.

Airflow then simply calls a version of the DockerOperator each time to run
the script/program within the project..   Working great for us!

-rob


On Mon, Feb 5, 2018 at 3:11 PM, Dennis O'Brien 
wrote:

> Hi Andrew,
>
> I think the issue is that each worker has a single airflow entry point
> (what does `which airflow` point to) which has an associated environment
> and list of packages installed, whether those are managed via conda,
> virtualenv, or the available python environment.  So the executor would
> need to know which environment you want to run.  I don't know how this
> would be possible with the LocalExecutor or SequentialExecutor since both
> are tied to the original python environment.  (Someone correct me if I am
> wrong here.  I'm definitely not an expert on the Airflow internals.)
>
> The BashOperator will allow you to run any process you want, including any
> Python environment, but there is some plumbing overhead required if you
> want access to the context, etc.  The CeleryExecutor (and any of the
> executors that support distributed workers) plus a queue gets around the
> issue of the worker environment tied to the scheduler environment.
>
> That said, I don't want to discourage you from trying things out.  I am
> sure there are some mysteries of Python that might make this possible.  For
> example, this project from Armin Ronacher that allows modules to use
> different versions of available libraries.  (Warning: I wouldn't use this
> in production.  I think it was more proof of concept.)
> https://github.com/mitsuhiko/multiversion
>
> cheers,
> Dennis
>
>
>
> On Mon, Feb 5, 2018 at 5:06 AM Andrew Maguire 
> wrote:
>
> > I am curious about similar issue. I'm wondering if we could use
> > https://github.com/pypa/pipenv - so each dag is in a folder say and that
> > folder has pipfile.lock that i think could then sort of bundle the
> required
> > environment into the dag code folder itself.
> >
> > I've not used this yet or anything but seems interesting...
> >
> > On Mon, Feb 5, 2018 at 7:17 AM Dennis O'Brien 
> > wrote:
> >
> > > Thanks for the input!  I'll take a look at using queues for this.
> > >
> > > thanks,
> > > Dennis
> > >
> > > On Tue, Jan 30, 2018 at 4:17 PM Hbw 
> > > wrote:
> > >
> > > > Run them on different workers by using queues?
> > > > That way different workers can have different 3rd party libs while
> > > sharing
> > > > the same af core.
> > > >
> > > > B
> > > >
> > > > Sent from a device with less than stellar autocorrect
> > > >
> > > > > On Jan 30, 2018, at 9:13 AM, Dennis O'Brien <
> den...@dennisobrien.net
> > >
> > > > wrote:
> > > > >
> > > > > Hi All,
> > > > >
> > > > > I have a number of jobs that use scikit-learn for scoring players.
> > > > > Occasionally I need to upgrade scikit-learn to take advantage of
> some
> > > new
> > > > > features.  We have a single conda environment that specifies all
> the
> > > > > dependencies for Airflow as well as for all of our DAGs.  So
> > currently
> > > > > upgrading scikit-learn means upgrading it for all DAGs that use it,
> > and
> > > > > retraining all models for that version.  It becomes a very involved
> > > task
> > > > > and I'm hoping to find a better way.
> > > > >
> > > > > One option is to use BashOperator (or something that wraps
> > > BashOperator)
> > > > > and have bash use a specific conda environment with that version of
> > > > > scikit-learn.  While simple, I don't like the idea of limiting task
> > > input
> > > > > to the command line.  Still, an option.
> > > > >
> > > > > Another option is the DockerOperator.  But when I asked around at a
> > > > > previous Airflow Meetup, I couldn't find anyone actually using it.
> > It
> > > > also
> > > > > adds some complexity to the build and deploy process in that now I
> > have
> > > > to
> > > > > maintain docker images for all my environments.  Still, not ruling
> it
> > > > 

Re: Making Airflow Timezone aware

2017-11-15 Thread Rob Goretsky
This will be huge for my team at MLB.com!  Really appreciate your work on this, 
Bolke!  We will finally be able to take down the posters we've all hung up at 
our desks that show the current GMT offset!  Let us know how/when we can try it 
out!

-rob 

> On Nov 15, 2017, at 7:33 PM, George Leslie-Waksman 
>  wrote:
> 
> Really happy to hear this moving forward. Thanks Bolke!
> 
>> On Tue, Nov 14, 2017 at 7:44 AM Bolke de Bruin  wrote:
>> 
>> See inline answers below.
>> 
>> Verstuurd vanaf mijn iPad
>> 
>>> Op 14 nov. 2017 om 16:33 heeft Heistermann, Till <
>> till.heisterm...@blue-yonder.com> het volgende geschreven:
>>> 
>>> Hi Bolke,
>>> 
>>> This looks great.
>>> 
>>> We have had the requirement to run DAGs in different local time zones
>> for a while, so far we worked around the limitation on dag-level to
>> automate most of our DST switches.
>>> 
>>> How would the approach behave in the DST-Switch corner cases?
>>> 
>>> For the regular case, I understand that if start_date=datetime(2017, 1,
>> 1, 8, 30, 0, tzinfo=“Europe/Amsterdam”)  and the  schedule is “30 8 * * *”,
>> the DST switch would work as expected, and the dag would get scheduled at
>> 7:30 am UTC in European Winter and 6:30 UTC in European Summer.
>> 
>> Actually no. For cron defined schedules we will always use local time, but
>> naive. This means your 8.30 schedule will always happen 8.30 local time
>> regardless.
>> 
>>> 
>>> However, if start_date=datetime(2017, 1, 1, 2, 30, 0,
>> tzinfo=“Europe/Amsterdam”)  and the schedule is “30 2 * * *”, would we skip
>> a nightly run in March and have two nightly runs in October?
>>> This seems like the correct thing to do from a time zone logic point of
>> view, although I can imagine that there are many operational use cases
>> where the user wants something different.
>> 
>> I have to verify what happens. I think what will happen is that it will
>> run at 3.30 as we convert to naive local time (dst unaware) add the
>> interval convert back to UTC. UTC will then translate to 3.30 local time
>> which is btw equal to 2.30 local time.
>> 
>> Execution_date will be in UTC. The DAG will store time zone information so
>> you can decide yourself what you want to do with that.
>> 
>> 
>>> 
>>> If start_date=datetime(2017, 1, 1, 8, 30, 0, tzinfo=“Europe/Amsterdam”)
>> and the schedule is timedelta(days=14), would a DST switch actually occur?
>>> There is some ambiguity in this case, depending on the
>> timedelta(days=14) being understood as either “14 days in local calendar”
>> or 14*24*60*60 seconds on the system clock.
>>> I’m not sure what the expected behaviour should be in this case.
>> 
>> For timedeltas DST is in effect. It is assumed here that you want to run X
>> hours later, not at a specific time. Obviously if you want to keep the old
>> behavior (and this is the default) keep your Timezone at Utc.
>> 
>>> 
>>> Cheers,
>>> Till
>>> 
>>> 
>>> On 13.11.17, 19:47, "Ash Berlin-Taylor" 
>> wrote:
>>> 
>>>   This sounds like an awesome change!
>>> 
>>>   I'm happy to review (will take a look tomorrow) but won't be a
>> suitable tester as all our DAGs operate in UTC.
>>> 
>>>   -ash
>>> 
>>> 
 On 13 Nov 2017, at 18:09, Bolke de Bruin  wrote:
 
 Hi All,
 
 I just want to make you aware that I am creating patches that make
>> Airflow timezone aware. The gist of the idea is that Airflow internally
>> will use and store UTC everywhere. This allows you to have start_date =
>> datetime(2017, 1, 1, tzinfo=“Europe/Amsterdam”) and Airflow will properly
>> take care of day light savings time. If you are using cron we will make
>> sure to always run at the exact time (end of interval of course) which you
>> specify even when DST is in effect, e.g. 8.00am is always 8.00am regardless
>> of if a day lights savings time has happened. DAGs that don’t have a
>> timezone associated, get a default timezone that is configurable.
 
 In AIRFLOW-288 I am tracking what needs to be done, but I am 80% there.
>> As the patches are invasive particularly in tests (everything needs a
>> timezone basically) less so in other areas I like to raise special
>> attention to a couple of places where this has impact.
 
 1. All database DateTime fields are converted to timezone aware
>> Timestamp fields. This impacts MySQL deployments particularly as MySQL was
>> storing DateTime fields, which cannot be made timezone aware. Also, to make
>> sure conversion happens properly we set the connection time zone to UTC.
>> This is supported by Postgres and MySQL. However, it is not supported by
>> SQLServer. So if you are running outside of UTC you need to take special
>> care when upgrading.
 
 2. Thou shall not use datetime.now() and datetime.utcnow() when writing
>> code for core (operators, sensors, scheduler etc) Airflow (in DAGs your can
>> still use it). Both create naive date times (yes even 

Re: Adjusting DAG Schedules (For Daylight Savings Time, And In General)

2017-03-09 Thread Rob Goretsky
I realized after sending this that some of this behavior about the updated
'start_date' not taking effect is explained / addressed in the "Proposal to
simplify start/end dates" thread on this mailing list.  Seems like
basically 'start_date' updates are effectively ignored, but
'scheduler_interval' updates are taken as a delta from the last
execution_date.  Still would be curious to hear insight from anyone who has
had to deal with this!

Thanks,
Rob

On Thu, Mar 9, 2017 at 4:47 PM, Rob Goretsky <robert.goret...@gmail.com>
wrote:

> With Daylight Savings Time upon us, I was wondering if anyone has had to
> address this issue -- While I understand that right now Airflow is not
> timezone-aware, and runs all of its jobs in GMT/UTC time, my team delivers
> reports to stakeholders that want to consistently see all data reported
> through Midnight **Eastern Time**.
>
> Right now we have a DAG is scheduled to run at 05:00 GMT, which correlates
> to Midnight Eastern time.   After this weekend, we'll need the DAG
> scheduled to run at 04:00GMT instead, so that it still correlates to
> Midnight eastern.   If we just try to modify the DAG Python definition to
> change the 'start_date', this doesn't seem to take effect - that is, the
> scheduler continues running the DAG at 05:00GMT. So, a few questions:
>
> (1) Once a DAG has been running, why don't changes to the Python
> 'start_date' seem to take effect?  It seems we always need to create a
> different dag with a different dag_id.   Is this something about the way
> the history is stored in the database, and is it something we could
> possibly tweak in the database directly if we wanted to?
>
> (2) Has anyone else dealt with this issue of needing to adjust a large set
> of DAGs for DST?  Or am I the only unlucky ones whose stakeholders don't
> speak GMT?
>
> Thanks for all of the help!
>
> -rob
>
>
>
>


Adjusting DAG Schedules (For Daylight Savings Time, And In General)

2017-03-09 Thread Rob Goretsky
With Daylight Savings Time upon us, I was wondering if anyone has had to
address this issue -- While I understand that right now Airflow is not
timezone-aware, and runs all of its jobs in GMT/UTC time, my team delivers
reports to stakeholders that want to consistently see all data reported
through Midnight **Eastern Time**.

Right now we have a DAG is scheduled to run at 05:00 GMT, which correlates
to Midnight Eastern time.   After this weekend, we'll need the DAG
scheduled to run at 04:00GMT instead, so that it still correlates to
Midnight eastern.   If we just try to modify the DAG Python definition to
change the 'start_date', this doesn't seem to take effect - that is, the
scheduler continues running the DAG at 05:00GMT. So, a few questions:

(1) Once a DAG has been running, why don't changes to the Python
'start_date' seem to take effect?  It seems we always need to create a
different dag with a different dag_id.   Is this something about the way
the history is stored in the database, and is it something we could
possibly tweak in the database directly if we wanted to?

(2) Has anyone else dealt with this issue of needing to adjust a large set
of DAGs for DST?  Or am I the only unlucky ones whose stakeholders don't
speak GMT?

Thanks for all of the help!

-rob


Re: Article: The Rise of the Data Engineer

2017-01-24 Thread Rob Goretsky
Maxime,
Just wanted to thank you for writing this article - much like the original
articles by Jeff Hammerbacher and DJ Patil coining the term "Data
Scientist", I feel this article stands as a great explanation of what the
title of "Data Engineer" means today..  As someone who has been working in
this role before the title existed, many of the points here rang true about
how the technology and tools have evolved..

I started my career working with graphical ETL tools (Informatica) and
could never shake the feeling that I could get a lot more done, with a more
maintainable set of processes, if I could just write reusable functions in
any programming language and then keep them in a shared library.  Instead,
what the GUI tools forced upon us were massive Wiki documents laying out
'the 9 steps you need to follow perfectly in order to build a proper
Informatica workflow' , that developers would painfully need to follow
along with, rather than being able to encapsulate the things that didn't
change in one central 'function' to pass in parameters for the things that
varied from the defaults.

I also spent a lot of time early in my career trying to design data
warehouse tables using the Kimball methodology with star schemas and all
dimensions extracted out to separate dimension tables.  As columnar storage
formats with compression became available (Vertica/Parquet/etc), I started
gravitating more towards the idea that I could just store the raw string
dimension data in the fact table directly, denormalized, but it always felt
like I was breaking the 'purist' rules on how to design data warehouse
schemas 'the right way'..  So in that regard, thanks for validating my
feeling that its ok to keep denormalized dimension data directly in fact
tables - it definitely makes our queries easier to write, and as you
mentioned, has the added benefit of helping you avoid all of that SCD fun!

We're about to put Airflow into production at my company (MLB.com) for a
handful of DAGs to start, so it will be running alongside our existing
Informatica server running 500+ workflows nightly..  But I can already see
the writing on the wall - it's really hard for us to find talented
engineers with Informatica experience along with more general computer
engineering backgrounds (many seem to have specialized in purely
Informatica) -  so our newer engineers come in with strong Python/SQL
backgrounds and have been gravitating towards building newer jobs in
Airflow...

One item that I think deserves addition to this article is the continuing
prevalence of SQL.   Many technologies have changed, but SQL has persisted
(pun intended?).  We went through a phase for a few years where it looked
like the tide was turning to MapReduce, Pig, or other languages for
accessing and aggregating data..  But now it seems even the "NoSQL" data
stores have added SQL layers on top, and we have more SQL engines for
Hadoop than I can count.   SQL is easy to learn but tougher to master, so
to me the two main languages in any modern Data Engineer's toolbelt are SQL
and a scripting language (Python/Ruby)..   I think it's amazing that with
so much change in every aspect of how we do data warehousing, SQL has stood
the test of time...

Anyways, thanks again for writing this up, I'll definitely be sharing it
with my team!

-Rob









On Fri, Jan 20, 2017 at 7:38 PM, Maxime Beauchemin <
maximebeauche...@gmail.com> wrote:

> Hey I just published an article about the "Data Engineer" role in modern
> organizations and thought it could be of interest to this community.
>
> https://medium.com/@maximebeauchemin/the-rise-of-the-data-engineer-
> 91be18f1e603#.5rkm4htnf
>
> Max
>


Re: NYC Meetup?

2016-12-22 Thread Rob Goretsky
We at MLB Advanced Media (MLBAM / MLB.com) are just about to get our first
few Airflow processes into production, so we'd love to join an NYC-based
meetup!

-rob


On Wed, Dec 21, 2016 at 9:49 AM, Jeremiah Lowin  wrote:

> It would be wonderful to have an east coast meetup! I would love to join if
> I can be in NY that day.
>
> Best,
> Jeremiah
>
> On Tue, Dec 20, 2016 at 4:24 PM Patrick D'Souza 
> wrote:
>
> > Having hosted a bunch of meetups in the past, we at SecurityScorecard are
> > very interested in hosting an Airflow meetup as well. We can easily host
> > around 50 people or so in January.
> >
> > On Fri, Dec 16, 2016 at 4:26 PM, Chris Riccomini 
> > wrote:
> >
> > > lol
> > >
> > > On Fri, Dec 16, 2016 at 11:04 AM, Joseph Napolitano <
> > > joseph.napolit...@blueapron.com.invalid> wrote:
> > >
> > > > Auto-correct got me. Metopes = Meetups
> > > >
> > > > Metope - a square space between triglyphs in a Doric frieze.
> > > >
> > > > On Fri, Dec 16, 2016 at 2:03 PM, Joseph Napolitano <
> > > > joseph.napolit...@blueapron.com> wrote:
> > > >
> > > > > We hosted several metopes here at Blue Apron.  I will bring it up
> to
> > > our
> > > > > administrative team and give an update.  Mid-january is probably a
> > good
> > > > > target.
> > > > >
> > > > > - Joe
> > > > >
> > > > > On Thu, Dec 15, 2016 at 5:18 PM, Luke Ptz 
> > > wrote:
> > > > >
> > > > >> Cool to see the interest is there! I unfortunately can't offer a
> > space
> > > > for
> > > > >> a meetup, can anyone else? If not could always be informal/meet
> in a
> > > > >> public
> > > > >> setting
> > > > >>
> > > > >> On Wed, Dec 14, 2016 at 7:08 PM, Andrew Phillips <
> > andr...@apache.org>
> > > > >> wrote:
> > > > >>
> > > > >> > We at Blue Apron would be very interested.
> > > > >> >>
> > > > >> >
> > > > >> > Same here.
> > > > >> >
> > > > >> > ap
> > > > >> >
> > > > >>
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > *Joe Napolitano *| Sr. Data Engineer
> > > > > www.blueapron.com | 5 Crosby Street, New York, NY 10013
> > > > >
> > > >
> > > >
> > > >
> > > > --
> > > > *Joe Napolitano *| Sr. Data Engineer
> > > > www.blueapron.com | 5 Crosby Street, New York, NY 10013
> > > >
> > >
> >
>