About structuring memory use: we have some major chunks of code set up as web services. We have a separate machine that runs one service (a Java-based app) and is limited to running 20 at once so that we can't run out of ram.
Our installation uses a separate Docker container for each Airflow app. Docker includes quotas for containers but we have not used them yet (cgroups). This feature allows us to allocate X amount of space for each app, so one unruly app cannot crash the whole Airflow service. On Fri, Jun 3, 2016 at 9:41 AM, Dennis O'Brien <[email protected]> wrote: > Thanks very much for the help. > > It seems I had two errors happening here. First, as Mattias pointed out, I > was doing it wrong with the jinja2.PackageLoader. (It's always > embarrassing to email a dev list when the error is somewhere entirely > different.) I switched to jinja2.FileLoader and it worked. > > My other issue was from an out-of-memory problem. This wasn't obvious from > the task instance log, but when I found it when running the job command > line. I dialed down the concurrency in airflow.cfg and this fixed the > problem. I also deferred some imports so that the DAG itself was not > importing so much (the entire pydata stack) but the workers themselves did > the imports when run. > > And thanks for the pointers about template_searchpath and the pitfalls of > sys.path hacks. > > I'd still be interested to learn more about how others structure more > complex roll outs of Airflow. We're moving from the "proof of concept" > phase to the "we're doing this" phase so learning how others are > configuring and deploying would be really helpful. Maybe at the next > meetup. :-) > > cheers, > Dennis > > > On Thu, Jun 2, 2016 at 2:24 PM Maxime Beauchemin < > [email protected]> > wrote: > > > A few related things: > > * You can use the `template_searchpath` param of the DAG constructor to > add > > folders to the jinja searchpath for your DAG. Documented here: > > > > > http://pythonhosted.org/airflow/code.html?highlight=template_searchpath#airflow.models.DAG > > * Airflow only adds DAGS_FOLDER to your `sys.path` beyond that you have > to > > manage your PYTHONPATH on your own. Note that in the current version > > messing with `sys.path` affects the main thread, meaning that DAGs parsed > > after this alteration have a different `sys.path` than the ones before, > > which can create some serious, hard to debug problem. We're addressing > this > > issue in the next version where DAG parsing will be done in subprocesses > > > > Max > > > > On Thu, Jun 2, 2016 at 1:43 AM, Matthias Huschle < > > [email protected]> wrote: > > > > > Hi Dennis, > > > > > > the first error is thrown by jinja2.PackageLoader. I think you still > have > > > to use dot notation in the first argument, as the module itself is > under > > > the reports path: > > > > > > In: > > > > > > > > > "/home/airflow/workspace/verticadw/airflow/dags/reports/gsn_kpi_daily_email. > > > py", line 212, in get_email_html > > > Change: > > > env = > > jinja2.Environment(loader=jinja2.PackageLoader('gsn_kpi_daily_email', > > > 'templates')) > > > To: > > > env = jinja2.Environment(loader=jinja2.PackageLoader('reports.gsn_kpi_ > > > daily_email', 'templates')) > > > > > > For the second error I don't see a cause. You should first check > sys.path > > > from within the script to see if etl/lib/ is properly added. It's > strange > > > that the first error is thrown during runtime of the same module that > > fails > > > to import in the second error. Do you modify sys.path from within your > > > scripts? > > > > > > If I understand your setup correctly, an __init__.py is only necessary > in > > > reports. I don't think it has any purpose in folders, that are directly > > in > > > sys.path . However, the names "lib" and "db_connect" are quite generic. > > I'd > > > consider renaming lib (sth. like etl_lib) and adding just etl/ to > > sys.path > > > , and an __init__.py to the lib folder to avoid namespace pollution. > > You'd > > > have to use "from etl_lib import db_connect" then, of course. > > > > > > Hope that helps, > > > Matthias > > > > > > > > > 2016-06-01 20:10 GMT+02:00 Dennis O'Brien <[email protected]>: > > > > > > > Hi folks > > > > > > > > I'm looking for some advice here on how others separate their DAGs > and > > > the > > > > code those DAGs call and any PYTHONPATH fixups that may be necessary. > > > > > > > > I have a project that looks like this > > > > > > > > . > > > > ├── airflow > > > > │ ├── dags > > > > │ │ ├── reports > > > > │ │ └── sql > > > > │ └── deploy > > > > │ └── templates > > > > ├── etl > > > > │ ├── lib > > > > > > > > All the DAGs are in airflow/dags > > > > The sql used by SqlSensor tasks are in airflow/dags/sql > > > > The python code used by PythonOperator is in airflow/dags/reports and > > > > etl/lib > > > > Existing etl code is all in etl > > > > > > > > In ./airflow/dags/etl_gsn_daily_kpi_email.py > > > > ``` > > > > from reports.gsn_kpi_daily_email import send_daily_kpi_email > > > > ``` > > > > > > > > I thought I could just import code in airflow/dags/reports from > > > > airflow/dags since DAGS_FOLDER is added to sys.path but after > deploying > > > the > > > > code I saw an error in the web UI about failing to import the module > > > > `reports.gsn_kpi_daily_email`. So I added __init__.py files in dags > > and > > > > dags/reports with no success. Then I modified my upstart scripts to > > fix > > > up > > > > the PYTHONPATH. > > > > > > > > ``` > > > > env PYTHONPATH=$PYTHONPATH:{{ destination_dir }}/airflow/dags/:{{ > > > > destination_dir }}/etl/lib/ > > > > export PYTHONPATH > > > > ``` > > > > > > > > This fixed the error in the web UI but on the next run of the job, I > > got > > > > these tracebacks: > > > > ``` > > > > [2016-06-01 12:14:38,352] {models.py:1286} ERROR - No module named > > > > gsn_kpi_daily_email > > > > Traceback (most recent call last): > > > > File > > > > > > "/home/airflow/venv/local/lib/python2.7/site-packages/airflow/models.py", > > > > line 1245, in run > > > > result = task_copy.execute(context=context) > > > > File > > > > > > > > > > > > > > "/home/airflow/venv/lib/python2.7/site-packages/airflow/operators/python_operator.py", > > > > line 66, in execute > > > > return_value = self.python_callable(*self.op_args, > > **self.op_kwargs) > > > > File > > > > > > > > > > > > > > "/home/airflow/workspace/verticadw/airflow/dags/reports/gsn_kpi_daily_email.py", > > > > line 223, in send_daily_kpi_email > > > > html = get_email_html(kpi_df) > > > > File > > > > > > > > > > > > > > "/home/airflow/workspace/verticadw/airflow/dags/reports/gsn_kpi_daily_email.py", > > > > line 212, in get_email_html > > > > env = > > > > jinja2.Environment(loader=jinja2.PackageLoader('gsn_kpi_daily_email', > > > > 'templates')) > > > > File > > > > > > "/home/airflow/venv/local/lib/python2.7/site-packages/jinja2/loaders.py", > > > > line 224, in __init__ > > > > provider = get_provider(package_name) > > > > File > > > > > > > > > > > > > > "/home/airflow/venv/local/lib/python2.7/site-packages/pkg_resources/__init__.py", > > > > line 419, in get_provider > > > > __import__(moduleOrReq) > > > > ImportError: No module named gsn_kpi_daily_email > > > > > > > > ... > > > > > > > > [2016-06-01 12:19:42,556] {models.py:250} ERROR - Failed to import: > > > > > > /home/airflow/workspace/verticadw/airflow/dags/etl_gsn_daily_kpi_email.py > > > > Traceback (most recent call last): > > > > File > > > > > > "/home/airflow/venv/local/lib/python2.7/site-packages/airflow/models.py", > > > > line 247, in process_file > > > > m = imp.load_source(mod_name, filepath) > > > > File > > > > > > > > > > > > > > "/home/airflow/workspace/verticadw/airflow/dags/etl_gsn_daily_kpi_email.py", > > > > line 4, in <module> > > > > from reports.gsn_kpi_daily_email import send_daily_kpi_email > > > > File > > > > > > > > > > > > > > "/home/airflow/workspace/verticadw/airflow/dags/reports/gsn_kpi_daily_email.py", > > > > line 8, in <module> > > > > from db_connect import get_db_connection_native as > > get_db_connection > > > > ImportError: No module named db_connect > > > > ``` > > > > > > > > The first error is strange because the module it can't find, > > > > gsn_kpi_daily_email, > > > > is in the stack trace. > > > > > > > > With that second error, db_connect is in etl/lib which I added to the > > > > PYTHONPATH. > > > > > > > > If anyone has advice on how to separate DAG code and other Python > code, > > > I'd > > > > appreciate any pointers. > > > > > > > > And some configuration info: > > > > airflow[celery,crypto,hive,jdbc,postgres,s3,redis,vertica]==1.7.1.2 > > > > celery[redis]==3.1.23 > > > > AWS EC2 m4.large with Ubuntu 14.04 AMI > > > > Using CeleryExecutor > > > > > > > > thanks, > > > > Dennis > > > > > > > > > > -- Lance Norskog [email protected] Redwood City, CA
