Thanks very much for the help.

It seems I had two errors happening here.  First, as Mattias pointed out, I
was doing it wrong with the jinja2.PackageLoader.  (It's always
embarrassing to email a dev list when the error is somewhere entirely
different.)  I switched to jinja2.FileLoader and it worked.

My other issue was from an out-of-memory problem.  This wasn't obvious from
the task instance log, but when I found it when running the job command
line.  I dialed down the concurrency in airflow.cfg and this fixed the
problem.  I also deferred some imports so that the DAG itself was not
importing so much (the entire pydata stack) but the workers themselves did
the imports when run.

And thanks for the pointers about template_searchpath and the pitfalls of
sys.path hacks.

I'd still be interested to learn more about how others structure more
complex roll outs of Airflow.  We're moving from the "proof of concept"
phase to the "we're doing this" phase so learning how others are
configuring and deploying would be really helpful.  Maybe at the next
meetup. :-)

cheers,
Dennis


On Thu, Jun 2, 2016 at 2:24 PM Maxime Beauchemin <[email protected]>
wrote:

> A few related things:
> * You can use the `template_searchpath` param of the DAG constructor to add
> folders to the jinja searchpath for your DAG. Documented here:
>
> http://pythonhosted.org/airflow/code.html?highlight=template_searchpath#airflow.models.DAG
> * Airflow only adds DAGS_FOLDER to your `sys.path` beyond that you have to
> manage your PYTHONPATH on your own. Note that in the current version
> messing with `sys.path` affects the main thread, meaning that DAGs parsed
> after this alteration have a different `sys.path` than the ones before,
> which can create some serious, hard to debug problem. We're addressing this
> issue in the next version where DAG parsing will be done in subprocesses
>
> Max
>
> On Thu, Jun 2, 2016 at 1:43 AM, Matthias Huschle <
> [email protected]> wrote:
>
> > Hi Dennis,
> >
> > the first error is thrown by jinja2.PackageLoader. I think you still have
> > to use dot notation in the first argument, as the module itself is under
> > the reports path:
> >
> > In:
> >
> >
> "/home/airflow/workspace/verticadw/airflow/dags/reports/gsn_kpi_daily_email.
> > py", line 212, in get_email_html
> > Change:
> > env =
> jinja2.Environment(loader=jinja2.PackageLoader('gsn_kpi_daily_email',
> > 'templates'))
> > To:
> > env = jinja2.Environment(loader=jinja2.PackageLoader('reports.gsn_kpi_
> > daily_email', 'templates'))
> >
> > For the second error I don't see a cause. You should first check sys.path
> > from within the script to see if etl/lib/ is properly added. It's strange
> > that the first error is thrown during runtime of the same module that
> fails
> > to import in the second error. Do you modify sys.path from within your
> > scripts?
> >
> > If I understand your setup correctly, an __init__.py is only necessary in
> > reports. I don't think it has any purpose in folders, that are directly
> in
> > sys.path . However, the names "lib" and "db_connect" are quite generic.
> I'd
> > consider renaming lib (sth. like etl_lib) and adding just etl/ to
> sys.path
> > , and an __init__.py to the lib folder to avoid namespace pollution.
> You'd
> > have to use "from etl_lib import db_connect" then, of course.
> >
> > Hope that helps,
> > Matthias
> >
> >
> > 2016-06-01 20:10 GMT+02:00 Dennis O'Brien <[email protected]>:
> >
> > > Hi folks
> > >
> > > I'm looking for some advice here on how others separate their DAGs and
> > the
> > > code those DAGs call and any PYTHONPATH fixups that may be necessary.
> > >
> > > I have a project that looks like this
> > >
> > > .
> > > ├── airflow
> > > │ ├── dags
> > > │ │ ├── reports
> > > │ │ └── sql
> > > │ └── deploy
> > > │    └── templates
> > > ├── etl
> > > │ ├── lib
> > >
> > > All the DAGs are in airflow/dags
> > > The sql used by SqlSensor tasks are in airflow/dags/sql
> > > The python code used by PythonOperator is in airflow/dags/reports and
> > > etl/lib
> > > Existing etl code is all in etl
> > >
> > > In ./airflow/dags/etl_gsn_daily_kpi_email.py
> > > ```
> > > from reports.gsn_kpi_daily_email import send_daily_kpi_email
> > > ```
> > >
> > > I thought I could just import code in airflow/dags/reports from
> > > airflow/dags since DAGS_FOLDER is added to sys.path but after deploying
> > the
> > > code I saw an error in the web UI about failing to import the module
> > > `reports.gsn_kpi_daily_email`.  So I added __init__.py files in dags
> and
> > > dags/reports with no success.  Then I modified my upstart scripts to
> fix
> > up
> > > the PYTHONPATH.
> > >
> > > ```
> > > env PYTHONPATH=$PYTHONPATH:{{ destination_dir }}/airflow/dags/:{{
> > > destination_dir }}/etl/lib/
> > > export PYTHONPATH
> > > ```
> > >
> > > This fixed the error in the web UI but on the next run of the job, I
> got
> > > these tracebacks:
> > > ```
> > > [2016-06-01 12:14:38,352] {models.py:1286} ERROR - No module named
> > > gsn_kpi_daily_email
> > > Traceback (most recent call last):
> > >   File
> > >
> "/home/airflow/venv/local/lib/python2.7/site-packages/airflow/models.py",
> > > line 1245, in run
> > >     result = task_copy.execute(context=context)
> > >   File
> > >
> > >
> >
> "/home/airflow/venv/lib/python2.7/site-packages/airflow/operators/python_operator.py",
> > > line 66, in execute
> > >     return_value = self.python_callable(*self.op_args,
> **self.op_kwargs)
> > >   File
> > >
> > >
> >
> "/home/airflow/workspace/verticadw/airflow/dags/reports/gsn_kpi_daily_email.py",
> > > line 223, in send_daily_kpi_email
> > >     html = get_email_html(kpi_df)
> > >   File
> > >
> > >
> >
> "/home/airflow/workspace/verticadw/airflow/dags/reports/gsn_kpi_daily_email.py",
> > > line 212, in get_email_html
> > >     env =
> > > jinja2.Environment(loader=jinja2.PackageLoader('gsn_kpi_daily_email',
> > > 'templates'))
> > >   File
> > >
> "/home/airflow/venv/local/lib/python2.7/site-packages/jinja2/loaders.py",
> > > line 224, in __init__
> > >     provider = get_provider(package_name)
> > >   File
> > >
> > >
> >
> "/home/airflow/venv/local/lib/python2.7/site-packages/pkg_resources/__init__.py",
> > > line 419, in get_provider
> > >     __import__(moduleOrReq)
> > > ImportError: No module named gsn_kpi_daily_email
> > >
> > > ...
> > >
> > > [2016-06-01 12:19:42,556] {models.py:250} ERROR - Failed to import:
> > >
> /home/airflow/workspace/verticadw/airflow/dags/etl_gsn_daily_kpi_email.py
> > > Traceback (most recent call last):
> > >   File
> > >
> "/home/airflow/venv/local/lib/python2.7/site-packages/airflow/models.py",
> > > line 247, in process_file
> > >     m = imp.load_source(mod_name, filepath)
> > >   File
> > >
> > >
> >
> "/home/airflow/workspace/verticadw/airflow/dags/etl_gsn_daily_kpi_email.py",
> > > line 4, in <module>
> > >     from reports.gsn_kpi_daily_email import send_daily_kpi_email
> > >   File
> > >
> > >
> >
> "/home/airflow/workspace/verticadw/airflow/dags/reports/gsn_kpi_daily_email.py",
> > > line 8, in <module>
> > >     from db_connect import get_db_connection_native as
> get_db_connection
> > > ImportError: No module named db_connect
> > > ```
> > >
> > > The first error is strange because the module it can't find,
> > > gsn_kpi_daily_email,
> > > is in the stack trace.
> > >
> > > With that second error, db_connect is in etl/lib which I added to the
> > > PYTHONPATH.
> > >
> > > If anyone has advice on how to separate DAG code and other Python code,
> > I'd
> > > appreciate any pointers.
> > >
> > > And some configuration info:
> > > airflow[celery,crypto,hive,jdbc,postgres,s3,redis,vertica]==1.7.1.2
> > > celery[redis]==3.1.23
> > > AWS EC2 m4.large with Ubuntu 14.04 AMI
> > > Using CeleryExecutor
> > >
> > > thanks,
> > > Dennis
> > >
> >
>

Reply via email to