Hi folks

I'm looking for some advice here on how others separate their DAGs and the
code those DAGs call and any PYTHONPATH fixups that may be necessary.

I have a project that looks like this

.
├── airflow
│ ├── dags
│ │ ├── reports
│ │ └── sql
│ └── deploy
│    └── templates
├── etl
│ ├── lib

All the DAGs are in airflow/dags
The sql used by SqlSensor tasks are in airflow/dags/sql
The python code used by PythonOperator is in airflow/dags/reports and
etl/lib
Existing etl code is all in etl

In ./airflow/dags/etl_gsn_daily_kpi_email.py
```
from reports.gsn_kpi_daily_email import send_daily_kpi_email
```

I thought I could just import code in airflow/dags/reports from
airflow/dags since DAGS_FOLDER is added to sys.path but after deploying the
code I saw an error in the web UI about failing to import the module
`reports.gsn_kpi_daily_email`.  So I added __init__.py files in dags and
dags/reports with no success.  Then I modified my upstart scripts to fix up
the PYTHONPATH.

```
env PYTHONPATH=$PYTHONPATH:{{ destination_dir }}/airflow/dags/:{{
destination_dir }}/etl/lib/
export PYTHONPATH
```

This fixed the error in the web UI but on the next run of the job, I got
these tracebacks:
```
[2016-06-01 12:14:38,352] {models.py:1286} ERROR - No module named
gsn_kpi_daily_email
Traceback (most recent call last):
  File
"/home/airflow/venv/local/lib/python2.7/site-packages/airflow/models.py",
line 1245, in run
    result = task_copy.execute(context=context)
  File
"/home/airflow/venv/lib/python2.7/site-packages/airflow/operators/python_operator.py",
line 66, in execute
    return_value = self.python_callable(*self.op_args, **self.op_kwargs)
  File
"/home/airflow/workspace/verticadw/airflow/dags/reports/gsn_kpi_daily_email.py",
line 223, in send_daily_kpi_email
    html = get_email_html(kpi_df)
  File
"/home/airflow/workspace/verticadw/airflow/dags/reports/gsn_kpi_daily_email.py",
line 212, in get_email_html
    env =
jinja2.Environment(loader=jinja2.PackageLoader('gsn_kpi_daily_email',
'templates'))
  File
"/home/airflow/venv/local/lib/python2.7/site-packages/jinja2/loaders.py",
line 224, in __init__
    provider = get_provider(package_name)
  File
"/home/airflow/venv/local/lib/python2.7/site-packages/pkg_resources/__init__.py",
line 419, in get_provider
    __import__(moduleOrReq)
ImportError: No module named gsn_kpi_daily_email

...

[2016-06-01 12:19:42,556] {models.py:250} ERROR - Failed to import:
/home/airflow/workspace/verticadw/airflow/dags/etl_gsn_daily_kpi_email.py
Traceback (most recent call last):
  File
"/home/airflow/venv/local/lib/python2.7/site-packages/airflow/models.py",
line 247, in process_file
    m = imp.load_source(mod_name, filepath)
  File
"/home/airflow/workspace/verticadw/airflow/dags/etl_gsn_daily_kpi_email.py",
line 4, in <module>
    from reports.gsn_kpi_daily_email import send_daily_kpi_email
  File
"/home/airflow/workspace/verticadw/airflow/dags/reports/gsn_kpi_daily_email.py",
line 8, in <module>
    from db_connect import get_db_connection_native as get_db_connection
ImportError: No module named db_connect
```

The first error is strange because the module it can't find,
gsn_kpi_daily_email,
is in the stack trace.

With that second error, db_connect is in etl/lib which I added to the
PYTHONPATH.

If anyone has advice on how to separate DAG code and other Python code, I'd
appreciate any pointers.

And some configuration info:
airflow[celery,crypto,hive,jdbc,postgres,s3,redis,vertica]==1.7.1.2
celery[redis]==3.1.23
AWS EC2 m4.large with Ubuntu 14.04 AMI
Using CeleryExecutor

thanks,
Dennis

Reply via email to