Re: Data lineage and data portal

2017-11-29 Thread Kate-Laurel Agnew
+1

On Wed, Nov 29, 2017 at 12:09 AM, Koen Mevissen 
wrote:

> +1
>
> I'm interested as well!
>
>
>
> Op di 28 nov. 2017 om 14:04 schreef Marc Bollinger 
>
> > +1
> >
> > On Mon, Nov 27, 2017 at 6:18 PM, Ruslan Dautkhanov  >
> > wrote:
> >
> > > ‘’’
> > > I'm
> > > now working on sql scanners, extractors and other tools that allow me
> to
> > > populate the database
> > > ‘’’
> > >
> > > Very cool. Cloudera Navigator ( not an open source product) does this
> too
> > > to some extent - collect metadata and create data lineage
> automatically (
> > > stored as a Solr collection) by parsing sql queries.
> > >
> > > https://www.cloudera.com/documentation/enterprise/5-12-
> > > x/topics/datamgmt_extraction_indexing.html
> > >
> > >
> > >
> > > On Mon, Nov 27, 2017 at 12:38 PM Gerard Toonstra 
> > > wrote:
> > >
> > > > Hi all,
> > > >
> > > > So something that really drew my attention recently is a "data
> portal"
> > > as
> > > > described by a team from airbnb somewhere in May. The idea is
> > basically a
> > > > "facebook of data":
> > > >
> > > >
> > > >
> > > > https://medium.com/airbnb-engineering/democratizing-
> > > data-at-airbnb-852d76c51770
> > > >
> > > >
> > > > Unfortunately it looks like it's not going to be opensourced due to
> how
> > > > heavily integrated it is with their specific infrastructure; but the
> > idea
> > > > itself to me sounds like it's something every organization of a
> certain
> > > > size should have to keep track of data and stay informed as an
> > > > organization.
> > > >
> > > > Based on the descriptions, I prototyped some things away and am happy
> > > with
> > > > the results and the speed that something like this can be
> constructed.
> > > I'm
> > > > now working on sql scanners, extractors and other tools that allow me
> > to
> > > > populate the database and put a poc together on some real data.
> > > >
> > > > If other people have similar concerns in their organization and think
> > > this
> > > > would be a great thing to have, reply to me or the list; with
> > sufficient
> > > > interest I may set up a web chat/meet session so this can be
> discussed
> > in
> > > > more detail and find ways to progress this.
> > > >
> > > >
> > > > Best regards,
> > > >
> > > > Gerard
> > > >
> > >
> >
> --
> Kind regards,
> Met vriendelijke groet,
>
> *Koen Mevissen*
> Principal BI Developer
>
>
> *Travix Nederland B.V.*
> Piet Heinkade 55
> 1019 GM Amsterdam
> The Netherlands
>
> T. +31 (0)20 203 3241
> E: kmevis...@travix.com
> www.travix.com
>
> *Brands: * CheapTickets  |  Vliegwinkel  |  Vayama  |  BudgetAir  |
>  Flugladen
>



-- 




*Kate-Laurel AgnewData Engineerm: 503-741-9207
<503%20741%209207>e: kag...@signal.co
signal.co
Cut
Through the NoiseThis e-mail and any files transmitted with it are for the
sole use of the intended recipient(s) and may contain confidential and
privileged information. Any unauthorized use of this email is strictly
prohibited. ©2015 Signal. All rights reserved.*


Re: Airflow Deployment tools

2017-11-15 Thread Kate-Laurel Agnew
Our Airflow situation:
• Development happens in two different repos (a repo that holds a lot of
cross-company python tools, for the core app and any plugins we develop on
top of it, and our reporting tools/infra repo, for DAGs and related utility
files).
• The core app & plugins get packaged together with their imports into a
.pex (using the 'pants' build tool)  with some related code and glue, and
can be manually deployed to the relevant box in staging and/or prod any
time with puppet (or automatically during the weekly deploy) once they're
in master.
• Updates to DAGs and utility files only get automatically deployed to prod
from master during the weekly deploy, but we're a little loosey-goosey
about editing DAGs in prod when necessary, since Airflow primarily handles
internal stuff and we're still in the early stages of switching our legacy
junk over to it.
• Right now, everything Airflow in prod runs on just one instance (AWS
EC2), with an AWS RDS MySQL backing it.  We haven't yet figured out what
our best options for scaling are going to be (opinions welcome, please).

On Wed, Nov 15, 2017 at 9:26 AM, Laura Lorenz 
wrote:

> Infrastructure wise we use docker containers, hosted via Kubernetes on
> Google Container Engine  and deployed with Helm. We bake our DAGs and
> custom code into the images - so in the end the deployer does a `helm
> upgrade` command locally, the images are rebuilt with the newest code, and
> then all the containers are recreated with that new image. Our webserver,
> worker, flower, and scheduler containers are derived off of
> https://github.com/puckel/docker-airflow, and we use rabbitmq official
> image off Docker Hub. Our metadata database is in Cloud SQL for our QA and
> production clusters on GCE, but for local dev we use the official mysql
> image from docker hub. This style of deployment interrupts any running
> tasks since the worker container is also killed to be recreated off the new
> image.
>
> On Wed, Nov 15, 2017 at 7:42 AM, Zsolt Tóth 
> wrote:
>
> > We are also using Ansible for:
> > - Installing/upgrading/configuring Airflow (there are several airflow
> > roles
> > on git)
> > - Deploying the pipelines
> > - Restarting Airflow webserver/scheduler
> >
> > It would be great to have Airflow manageable from Hadoop cluster managers
> > (Cloudera Manager, Ambari). For this a parcel (for cloudera) should be
> > created and installed. If anyone has done this before, please share the
> > experience!
> >
> > Zsolt
> >
> >
> > 2017-11-15 13:30 GMT+01:00 Andrew Maguire :
> >
> > > Is there any options at all out there for Airflow as a service type
> > > approach?
> > >
> > > I'd love to just be able to define my dags and load them to some cloud
> ui
> > > and not have to worry about anything else.
> > >
> > > This looks kinda interesting -
> > > http://docs.qubole.com/en/latest/user-guide/airflow/
> > > introduction-airflow.html
> > >
> > > Cheers,
> > > Andy
> > >
> > > On Wed, Nov 15, 2017 at 10:28 AM Driesprong, Fokko
>  > >
> > > wrote:
> > >
> > > > I'm using Ansible to deploy the Airflow, the steps are:
> > > > - First install Airflow using pip (or a rc using curl)
> > > > - Do an `airflow version` to trigger the creation of the default
> config
> > > > - Set the config correctly variables in the config using Ansible.
> > > > - Deploy the supervisord files
> > > > - Start everything
> > > >
> > > > A separate role is there to deploy Postgres. But if you are working
> on
> > a
> > > > cloud environment, you can also get Postgres/MySQL as a service. Hope
> > > this
> > > > helps.
> > > >
> > > > Cheers, Fokko
> > > >
> > > > 2017-11-15 3:19 GMT+01:00 Marc Bollinger :
> > > >
> > > > > Samson  deploy that runs a
> script
> > > > > running a Broadside 
> deploy
> > > for
> > > > > ECS, which bounces the Web and Scheduler workers, and updates the
> DAG
> > > > > directory on the workers. Docker images come from a Github ->
> Travis
> > ->
> > > > > Quay
> > > > >  CI setup.
> > > > >
> > > > > On Tue, Nov 14, 2017 at 10:18 AM, Alek Storm  >
> > > > wrote:
> > > > >
> > > > > > Our TeamCity server detects the master branch has changed, then
> > > > packages
> > > > > up
> > > > > > the repo containing our DAGs as an artifact. We then use
> SaltStack
> > to
> > > > > > trigger a bash script on the targeted servers that downloads the
> > > > > artifact,
> > > > > > moves the files to the right place, and restarts the scheduler
> (on
> > > the
> > > > > > master).
> > > > > >
> > > > > > This allows us to easily revert changes by redeploying a
> particular
> > > > > > TeamCity artifact, without touching the git history.
> > > > > >
> > > > > > Alek
> > > > > >
> > > > > > On Nov 14, 2017 11:02 AM, "Andy Hadjigeorgiou" <
> > andyxha...@gmail.com
> >