[jira] [Commented] (AIRFLOW-584) Airflow Pool does not limit running tasks
[ https://issues.apache.org/jira/browse/AIRFLOW-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348468#comment-16348468 ] barak schoster commented on AIRFLOW-584: was this solved in a higher version? > Airflow Pool does not limit running tasks > - > > Key: AIRFLOW-584 > URL: https://issues.apache.org/jira/browse/AIRFLOW-584 > Project: Apache Airflow > Issue Type: Bug > Components: pools >Affects Versions: Airflow 1.7.1.3 > Environment: Ubuntu 14.04 >Reporter: David Kegley >Priority: Major > Attachments: img1.png, img2.png > > > Airflow pools are not limiting the number of running task instances for the > following dag in 1.7.1.3 > Steps to recreate: > Create a pool of size 5 through the UI. > The following dag has 52 tasks with increasing priority corresponding to the > task number. There should only ever be 5 tasks running at a time however I > observed 29 'used slots' in a pool with 5 slots > {code} > dag_name = 'pools_bug' > default_args = { > 'owner': 'airflow', > 'depends_on_past': False, > 'start_date': datetime(2016, 10, 20), > 'email_on_failure': False, > 'retries': 1 > } > dag = DAG(dag_name, default_args=default_args, schedule_interval="0 8 * * *") > start = DummyOperator(task_id='start', dag=dag) > end = DummyOperator(task_id='end', dag=dag) > for i in range(50): > sleep_command = 'sleep 10' > task_name = 'task-{}'.format(i) > op = BashOperator( > task_id=task_name, > bash_command=sleep_command, > execution_timeout=timedelta(hours=4), > priority_weight=i, > pool=dag_name, > dag=dag) > start.set_downstream(op) > end.set_upstream(op) > {code} > Relevant configurations from airflow.cfg: > {code} > [core] > # The executor class that airflow should use. Choices include > # SequentialExecutor, LocalExecutor, CeleryExecutor > executor = CeleryExecutor > # The amount of parallelism as a setting to the executor. This defines > # the max number of task instances that should run simultaneously > # on this airflow installation > parallelism = 64 > # The number of task instances allowed to run concurrently by the scheduler > dag_concurrency = 64 > # The maximum number of active DAG runs per DAG > max_active_runs_per_dag = 1 > [celery] > # This section only applies if you are using the CeleryExecutor in > # [core] section above > # The app name that will be used by celery > celery_app_name = airflow.executors.celery_executor > # The concurrency that will be used when starting workers with the > # "airflow worker" command. This defines the number of task instances that > # a worker will take, so size up your workers based on the resources on > # your worker box and the nature of your tasks > celeryd_concurrency = 64 > [scheduler] > # Task instances listen for external kill signal (when you clear tasks > # from the CLI or the UI), this defines the frequency at which they should > # listen (in seconds). > job_heartbeat_sec = 5 > # The scheduler constantly tries to trigger new tasks (look at the > # scheduler section in the docs for more information). This defines > # how often the scheduler should run (in seconds). > scheduler_heartbeat_sec = 5 > {code} > !img1.png! > !img2.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AIRFLOW-2055) Variable access documentation in templated task instance is ambiguous
Matthew Bowden created AIRFLOW-2055: --- Summary: Variable access documentation in templated task instance is ambiguous Key: AIRFLOW-2055 URL: https://issues.apache.org/jira/browse/AIRFLOW-2055 Project: Apache Airflow Issue Type: Improvement Affects Versions: 1.9.0 Reporter: Matthew Bowden Assignee: Matthew Bowden Some of the internal documentation in {{airflow/models.py}} (specifically under {{TaskInstance.get_template_context}} is slightly ambiguous. This makes it a bit misleading for users attempting to write templates which use {{Variable}}s. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-1793) DockerOperator doesn't work with docker_conn_id
[ https://issues.apache.org/jira/browse/AIRFLOW-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349149#comment-16349149 ] Konrad Gołuchowski commented on AIRFLOW-1793: - Submitted fix: https://github.com/apache/incubator-airflow/pull/2998 > DockerOperator doesn't work with docker_conn_id > --- > > Key: AIRFLOW-1793 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1793 > Project: Apache Airflow > Issue Type: Bug >Reporter: Cedrik Neumann >Assignee: Cedrik Neumann >Priority: Major > > The implementation of DockerOperator uses `self.base_url` when loading the > DockerHook instead of `self.docker_url`: > https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/operators/docker_operator.py#L150 > {noformat} > [2017-11-08 16:10:13,082] {base_task_runner.py:98} INFO - Subtask: File > "/src/apache-airflow/airflow/operators/docker_operator.py", line 161, in > execute > [2017-11-08 16:10:13,083] {base_task_runner.py:98} INFO - Subtask: > self.cli = self.get_hook().get_conn() > [2017-11-08 16:10:13,083] {base_task_runner.py:98} INFO - Subtask: File > "/src/apache-airflow/airflow/operators/docker_operator.py", line 150, in > get_hook > [2017-11-08 16:10:13,083] {base_task_runner.py:98} INFO - Subtask: > base_url=self.base_url, > [2017-11-08 16:10:13,083] {base_task_runner.py:98} INFO - Subtask: > AttributeError: 'DockerOperator' object has no attribute 'base_url' > {noformat} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AIRFLOW-2056) Integrate Google Cloud Storage (GCS) operators into 1 file
Kaxil Naik created AIRFLOW-2056: --- Summary: Integrate Google Cloud Storage (GCS) operators into 1 file Key: AIRFLOW-2056 URL: https://issues.apache.org/jira/browse/AIRFLOW-2056 Project: Apache Airflow Issue Type: Improvement Components: contrib, gcp Affects Versions: Airflow 2.0, 2.0.0 Reporter: Kaxil Naik Assignee: Kaxil Naik Fix For: Airflow 2.0, 2.0.0 There are currently 5 operators: * GoogleCloudStorageCopyOperator * GoogleCloudStorageDownloadOperator * GoogleCloudStorageListOperator * GoogleCloudStorageToBigQueryOperator * GoogleCloudStorageToGoogleCloudStorageOperator It would be ideal to have 1 file *gcs_operator.py* similar to *dataproc_operator.py* containing all the operators related to Google Cloud Storage. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (AIRFLOW-1665) Airflow webserver/scheduler don't handle database disconnects (mysql)
[ https://issues.apache.org/jira/browse/AIRFLOW-1665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zgl reassigned AIRFLOW-1665: Assignee: zgl (was: Vasanth Kumar) > Airflow webserver/scheduler don't handle database disconnects (mysql) > - > > Key: AIRFLOW-1665 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1665 > Project: Apache Airflow > Issue Type: Bug >Affects Versions: Airflow 1.8 >Reporter: Vasanth Kumar >Assignee: zgl >Priority: Major > Labels: database, reconnect > Fix For: 1.9.1 > > > Airflow webserver & scheduler don't handle database disconnects. The process > appear to error out and either exit or are left in an off state. This was > observed when using mysql. > I don't see any database reconnect configuration or code. > Stack tace for scheduler: > File "./MySQLdb/connections.py", line 204, in __init__ > super(Connection, self).__init__(*args, **kwargs2) > sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) (2002, > "Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2)") -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (AIRFLOW-1012) Add run_as_script option so jinja templating can be used for sql parameter
[ https://issues.apache.org/jira/browse/AIRFLOW-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zgl reassigned AIRFLOW-1012: Assignee: zgl (was: Ruslan Dautkhanov) > Add run_as_script option so jinja templating can be used for sql parameter > -- > > Key: AIRFLOW-1012 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1012 > Project: Apache Airflow > Issue Type: Improvement > Components: core, db >Affects Versions: Airflow 1.8 >Reporter: Ruslan Dautkhanov >Assignee: zgl >Priority: Major > Labels: database, improvement, operators, sql > > It would be great to extend jinja templating to sql parameter for SQL > Operators. > With this improvement, it's possible to have extended Jinja template like > below that generates multiple SQL statements that can be passed as a single > 'sql' parameter, separated by ';' separator: > {noformat} > ) > >> OracleOperator( task_id='give_owner_grants', oracle_conn_id=ora_conn1, > run_as_script=True, > sql=''' > {% for role in ['CONNECT', 'RESOURCE'] %} > GRANT {{ role }} TO {{ schema }}; > {% endfor %} > {% for create_grant in ['PROCEDURE', 'SEQUENCE', 'SESSION', > 'TABLE', 'VIEW'] %} > GRANT CREATE {{ create_grant }} TO {{ schema }}; > {% endfor %} > {% for tbsp in ['DISCOVER_MART_IDX01', 'DISCOVER_MART_TBS01', > 'STAGING_NOLOG'] %} > ALTER USER {{ schema }} QUOTA UNLIMITED ON {{ tbsp }}; > {% endfor %} > GRANT SELECT ANY TABLE TO {{ schema }}; > GRANT EXECUTE ON SYS.DBMS_SESSION TO {{ schema }}; > ''' > ) > >> DummyOperator(task_id='stop') > {noformat} > Notice there are three Jinja 'for' loops that generate multiple SQL DDL > statements. > Without this change, sql has to be passed as an Python array, and Jinja > templating can't be used. > I've tested this change with OracleOperator and works as expected. > Notice `run_as_script=True` parameter. run_as_script defaults to False so > this is a backward-compatible change. > Most of the change is in airflow/hooks/dbapi_hook.py (very straightforward as > run() already supports running an array of statements) and a light change of > airflow/operators/oracle_operator.py - so this change can be easily applied > to other sql operators. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (AIRFLOW-18) Alembic's constraints and indexes are unnamed thus hard to drop or change
[ https://issues.apache.org/jira/browse/AIRFLOW-18?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zgl reassigned AIRFLOW-18: -- Assignee: zgl > Alembic's constraints and indexes are unnamed thus hard to drop or change > - > > Key: AIRFLOW-18 > URL: https://issues.apache.org/jira/browse/AIRFLOW-18 > Project: Apache Airflow > Issue Type: Bug > Components: db >Reporter: Bolke de Bruin >Assignee: zgl >Priority: Major > Labels: database > > Eg. in XXX_add_dagrun.py the constraint is added without a name: > sa.UniqueConstraint('dag_id', 'execution_date'), > This makes constraint naming database specific, ie. postgres' name for the > constraint be different than mysql's and sqllite's. > Best practice per http://alembic.readthedocs.io/en/latest/naming.html is to > have naming conventions that are being applied. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (AIRFLOW-20) Improving the scheduler by making dag runs more coherent
[ https://issues.apache.org/jira/browse/AIRFLOW-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zgl reassigned AIRFLOW-20: -- Assignee: zgl (was: Jin Mingjian) > Improving the scheduler by making dag runs more coherent > > > Key: AIRFLOW-20 > URL: https://issues.apache.org/jira/browse/AIRFLOW-20 > Project: Apache Airflow > Issue Type: Improvement > Components: scheduler >Reporter: Bolke de Bruin >Assignee: zgl >Priority: Major > Labels: backfill, database, scheduler > > The need to align the start_date with the interval is counter intuitive > and leads to a lot of questions and issue creation, although it is in the > documentation. If we are > able to fix this with none or little consequences for current setups that > should be preferred, I think. > The dependency explainer is really great work, but it doesn’t address the > core issue. > If you consider a DAG a description of cohesion between work items (in OOP > java terms > a class), then a DagRun is the instantiation of a DAG in time (in OOP java > terms an instance). > Tasks are then the description of a work item and a TaskInstance the > instantiation of the Task in time. > In my opinion issues pop up due to the current paradigm of considering the > TaskInstance > the smallest unit of work and asking it to maintain its own state in relation > to other TaskInstances > in a DagRun and in a previous DagRun of which it has no (real) perception. > Tasks are instantiated > by a cartesian product with the dates of DagRun instead of the DagRuns > itself. > The very loose coupling between DagRuns and TaskInstances can be improved > while maintaining > flexibility to run tasks without a DagRun. This would help with a couple of > things: > 1. start_date can be used as a ‘execution_date’ or a point in time when to > start looking > 2. a new interval for a dag will maintain depends_on_past > 3. paused dags do not give trouble > 4. tasks will be executed in order > 5. the ignore_first_depend_on_past could be removed as a task will now know > if it is really the first > In PR-1431 a lot of this work has been done by: > 1. Adding a “previous” field to a DagRun allowing it to connect to its > predecessor > 2. Adding a dag_run_id to TaskInstances so a TaskInstance knows about the > DagRun if needed > 3. Using start_date + interval as the first run date unless start_date is on > the interval then start_date is the first run date -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AIRFLOW-2057) Add Overstock to the list of Airflow users
Joy Gao created AIRFLOW-2057: Summary: Add Overstock to the list of Airflow users Key: AIRFLOW-2057 URL: https://issues.apache.org/jira/browse/AIRFLOW-2057 Project: Apache Airflow Issue Type: Task Reporter: Joy Gao -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (AIRFLOW-2058) Scheduler uses MainThread for DAG file processing
Yang Pan created AIRFLOW-2058: - Summary: Scheduler uses MainThread for DAG file processing Key: AIRFLOW-2058 URL: https://issues.apache.org/jira/browse/AIRFLOW-2058 Project: Apache Airflow Issue Type: Bug Components: DAG Affects Versions: 1.9.0 Environment: Ubuntu, Airflow 1.9, Sequential executor Reporter: Yang Pan By reading the [source code |https://github.com/apache/incubator-airflow/blob/61ff29e578d1121ab4606fe122fb4e2db8f075b9/airflow/utils/dag_processing.py#L538] it appears the scheduler will process each DAG file, either a .py or .zip, using a new process. If I understand correctly, in theory what should happen in terms of processing a .zip file is that the dedicated process will add the .zip file to the PYTHONPATH, and load the file's module and dependency. When the DAG read is done, the process gets destroyed. And since the PYTHONPATH is process scoped, it won't pollute other processes. However by printing out the threads and process id, it looks like Airflow scheduler can sometimes accidentally pick up the main process instead of creating a new one, and that's when collision happens. Here is snippet of the PYTHONPATH when advanced_dag_dependency-1.zip is being processed. As you can see when it's executed by MainThread, it contains other .zip files. When it's using dedicated thread, only required .zip is added. sys.path :['/root/airflow/dags/yang_subdag_2.zip', '/root/airflow/dags/yang_subdag_2.zip', '/root/airflow/dags/yang_subdag_1.zip', '/root/airflow/dags/yang_subdag_1.zip', '/root/airflow/dags/advanced_dag_dependency-2.zip', '/root/airflow/dags/advanced_dag_dependency-2.zip', '/root/airflow/dags/advanced_dag_dependency-1.zip', '/root/airflow/dags/advanced_dag_dependency-1.zip', '/root/airflow/dags/yang_subdag_1', '/usr/local/bin', '/usr/lib/python2.7', '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', '/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages/PILcompat', '/root/airflow/config', '/root/airflow/dags', '/root/airflow/plugins'] Print from MyFirstOperator in Dag 1 process id: 5059 thread id: <_MainThread(*MainThread*, started 140339858560768)> sys.path :[u'/root/airflow/dags/advanced_dag_dependency-1.zip', '/usr/local/bin', '/usr/lib/python2.7', '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', '/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages/PILcompat', '/root/airflow/config', '/root/airflow/dags', '/root/airflow/plugins'] Print from MyFirstOperator in Dag 1 process id: 5076 thread id: <_MainThread(*DagFileProcessor283*, started 140137838294784)> -- This message was sent by Atlassian JIRA (v7.6.3#76005)
incubator-airflow git commit: [AIRFLOW-2057] Add Overstock to list of companies
Repository: incubator-airflow Updated Branches: refs/heads/master 6d88744be -> ba0b1978d [AIRFLOW-2057] Add Overstock to list of companies Closes #3001 from mhousley/add-overstock-to-list Project: http://git-wip-us.apache.org/repos/asf/incubator-airflow/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-airflow/commit/ba0b1978 Tree: http://git-wip-us.apache.org/repos/asf/incubator-airflow/tree/ba0b1978 Diff: http://git-wip-us.apache.org/repos/asf/incubator-airflow/diff/ba0b1978 Branch: refs/heads/master Commit: ba0b1978d3c30d4f582115d18953969ecf6ba1ee Parents: 6d88744 Author: Matthew HousleyAuthored: Thu Feb 1 17:17:18 2018 -0800 Committer: Siddharth Anand Committed: Thu Feb 1 17:17:24 2018 -0800 -- README.md | 1 + 1 file changed, 1 insertion(+) -- http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/ba0b1978/README.md -- diff --git a/README.md b/README.md index 3584d34..c338c17 100644 --- a/README.md +++ b/README.md @@ -169,6 +169,7 @@ Currently **officially** using Airflow: 1. [OfferUp](https://offerupnow.com) 1. [OneFineStay](https://www.onefinestay.com) [[@slangwald](https://github.com/slangwald)] 1. [Open Knowledge International](https://okfn.org) [@vitorbaptista](https://github.com/vitorbaptista) +1. [Overstock](https://www.github.com/overstock) [[@mhousley](https://github.com/mhousley) & [@mct0006](https://github.com/mct0006)] 1. [Pandora Media](https://www.pandora.com/) [[@Acehaidrey](https://github.com/Acehaidrey)] 1. [PAYMILL](https://www.paymill.com/) [[@paymill](https://github.com/paymill) & [@matthiashuschle](https://github.com/matthiashuschle)] 1. [PayPal](https://www.paypal.com/) [[@r39132](https://github.com/r39132) & [@jhsenjaliya](https://github.com/jhsenjaliya)]
[jira] [Commented] (AIRFLOW-2057) Add Overstock to the list of Airflow users
[ https://issues.apache.org/jira/browse/AIRFLOW-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349624#comment-16349624 ] ASF subversion and git services commented on AIRFLOW-2057: -- Commit ba0b1978d3c30d4f582115d18953969ecf6ba1ee in incubator-airflow's branch refs/heads/master from [~mhousley] [ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=ba0b197 ] [AIRFLOW-2057] Add Overstock to list of companies Closes #3001 from mhousley/add-overstock-to-list > Add Overstock to the list of Airflow users > -- > > Key: AIRFLOW-2057 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2057 > Project: Apache Airflow > Issue Type: Task >Reporter: Joy Gao >Priority: Trivial > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (AIRFLOW-2058) Scheduler uses MainThread for DAG file processing
[ https://issues.apache.org/jira/browse/AIRFLOW-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yang Pan updated AIRFLOW-2058: -- Priority: Blocker (was: Major) > Scheduler uses MainThread for DAG file processing > - > > Key: AIRFLOW-2058 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2058 > Project: Apache Airflow > Issue Type: Bug > Components: DAG >Affects Versions: 1.9.0 > Environment: Ubuntu, Airflow 1.9, Sequential executor >Reporter: Yang Pan >Priority: Blocker > > By reading the [source code > |https://github.com/apache/incubator-airflow/blob/61ff29e578d1121ab4606fe122fb4e2db8f075b9/airflow/utils/dag_processing.py#L538] > it appears the scheduler will process each DAG file, either a .py or .zip, > using a new process. > > If I understand correctly, in theory what should happen in terms of > processing a .zip file is that the dedicated process will add the .zip file > to the PYTHONPATH, and load the file's module and dependency. When the DAG > read is done, the process gets destroyed. And since the PYTHONPATH is process > scoped, it won't pollute other processes. > > However by printing out the threads and process id, it looks like Airflow > scheduler can sometimes accidentally pick up the main process instead of > creating a new one, and that's when collision happens. > > Here is snippet of the PYTHONPATH when advanced_dag_dependency-1.zip is being > processed. As you can see when it's executed by MainThread, it contains other > .zip files. When it's using dedicated thread, only required .zip is added. > > sys.path :['/root/airflow/dags/yang_subdag_2.zip', > '/root/airflow/dags/yang_subdag_2.zip', > '/root/airflow/dags/yang_subdag_1.zip', > '/root/airflow/dags/yang_subdag_1.zip', > '/root/airflow/dags/advanced_dag_dependency-2.zip', > '/root/airflow/dags/advanced_dag_dependency-2.zip', > '/root/airflow/dags/advanced_dag_dependency-1.zip', > '/root/airflow/dags/advanced_dag_dependency-1.zip', > '/root/airflow/dags/yang_subdag_1', '/usr/local/bin', '/usr/lib/python2.7', > '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', > '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', > '/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages', > '/usr/lib/python2.7/dist-packages/PILcompat', '/root/airflow/config', > '/root/airflow/dags', '/root/airflow/plugins'] > Print from MyFirstOperator in Dag 1 > process id: 5059 > thread id: <_MainThread(*MainThread*, started 140339858560768)> > > sys.path :[u'/root/airflow/dags/advanced_dag_dependency-1.zip', > '/usr/local/bin', '/usr/lib/python2.7', > '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', > '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', > '/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages', > '/usr/lib/python2.7/dist-packages/PILcompat', '/root/airflow/config', > '/root/airflow/dags', '/root/airflow/plugins'] > Print from MyFirstOperator in Dag 1 > process id: 5076 > thread id: <_MainThread(*DagFileProcessor283*, started 140137838294784)> -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (AIRFLOW-2058) Scheduler uses MainThread for DAG file processing
[ https://issues.apache.org/jira/browse/AIRFLOW-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349694#comment-16349694 ] Yang Pan commented on AIRFLOW-2058: --- Impact wise, this causes dependency collision when DAG is being loaded. > Scheduler uses MainThread for DAG file processing > - > > Key: AIRFLOW-2058 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2058 > Project: Apache Airflow > Issue Type: Bug > Components: DAG >Affects Versions: 1.9.0 > Environment: Ubuntu, Airflow 1.9, Sequential executor >Reporter: Yang Pan >Priority: Major > > By reading the [source code > |https://github.com/apache/incubator-airflow/blob/61ff29e578d1121ab4606fe122fb4e2db8f075b9/airflow/utils/dag_processing.py#L538] > it appears the scheduler will process each DAG file, either a .py or .zip, > using a new process. > > If I understand correctly, in theory what should happen in terms of > processing a .zip file is that the dedicated process will add the .zip file > to the PYTHONPATH, and load the file's module and dependency. When the DAG > read is done, the process gets destroyed. And since the PYTHONPATH is process > scoped, it won't pollute other processes. > > However by printing out the threads and process id, it looks like Airflow > scheduler can sometimes accidentally pick up the main process instead of > creating a new one, and that's when collision happens. > > Here is snippet of the PYTHONPATH when advanced_dag_dependency-1.zip is being > processed. As you can see when it's executed by MainThread, it contains other > .zip files. When it's using dedicated thread, only required .zip is added. > > sys.path :['/root/airflow/dags/yang_subdag_2.zip', > '/root/airflow/dags/yang_subdag_2.zip', > '/root/airflow/dags/yang_subdag_1.zip', > '/root/airflow/dags/yang_subdag_1.zip', > '/root/airflow/dags/advanced_dag_dependency-2.zip', > '/root/airflow/dags/advanced_dag_dependency-2.zip', > '/root/airflow/dags/advanced_dag_dependency-1.zip', > '/root/airflow/dags/advanced_dag_dependency-1.zip', > '/root/airflow/dags/yang_subdag_1', '/usr/local/bin', '/usr/lib/python2.7', > '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', > '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', > '/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages', > '/usr/lib/python2.7/dist-packages/PILcompat', '/root/airflow/config', > '/root/airflow/dags', '/root/airflow/plugins'] > Print from MyFirstOperator in Dag 1 > process id: 5059 > thread id: <_MainThread(*MainThread*, started 140339858560768)> > > sys.path :[u'/root/airflow/dags/advanced_dag_dependency-1.zip', > '/usr/local/bin', '/usr/lib/python2.7', > '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', > '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', > '/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages', > '/usr/lib/python2.7/dist-packages/PILcompat', '/root/airflow/config', > '/root/airflow/dags', '/root/airflow/plugins'] > Print from MyFirstOperator in Dag 1 > process id: 5076 > thread id: <_MainThread(*DagFileProcessor283*, started 140137838294784)> -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (AIRFLOW-2057) Add Overstock to the list of Airflow users
[ https://issues.apache.org/jira/browse/AIRFLOW-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Anand updated AIRFLOW-2057: - External issue URL: https://issues.apache.org/jira/browse/AIRFLOW-2057 > Add Overstock to the list of Airflow users > -- > > Key: AIRFLOW-2057 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2057 > Project: Apache Airflow > Issue Type: Task >Reporter: Joy Gao >Priority: Trivial > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (AIRFLOW-2057) Add Overstock to the list of Airflow users
[ https://issues.apache.org/jira/browse/AIRFLOW-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Siddharth Anand reassigned AIRFLOW-2057: Assignee: Siddharth Anand > Add Overstock to the list of Airflow users > -- > > Key: AIRFLOW-2057 > URL: https://issues.apache.org/jira/browse/AIRFLOW-2057 > Project: Apache Airflow > Issue Type: Task >Reporter: Joy Gao >Assignee: Siddharth Anand >Priority: Trivial > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
incubator-airflow git commit: [AIRFLOW-XXX] Add Plaid to Airflow users
Repository: incubator-airflow Updated Branches: refs/heads/master 6ee4bbd4b -> 6d88744be [AIRFLOW-XXX] Add Plaid to Airflow users Closes #2995 from AustinBGibbons/master Project: http://git-wip-us.apache.org/repos/asf/incubator-airflow/repo Commit: http://git-wip-us.apache.org/repos/asf/incubator-airflow/commit/6d88744b Tree: http://git-wip-us.apache.org/repos/asf/incubator-airflow/tree/6d88744b Diff: http://git-wip-us.apache.org/repos/asf/incubator-airflow/diff/6d88744b Branch: refs/heads/master Commit: 6d88744bee498fdb42417e3b82365e126ece76b5 Parents: 6ee4bbd Author: Austin GibbonsAuthored: Thu Feb 1 10:04:07 2018 +0100 Committer: Fokko Driesprong Committed: Thu Feb 1 10:04:14 2018 +0100 -- README.md | 1 + 1 file changed, 1 insertion(+) -- http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/6d88744b/README.md -- diff --git a/README.md b/README.md index 56c9cea..3584d34 100644 --- a/README.md +++ b/README.md @@ -174,6 +174,7 @@ Currently **officially** using Airflow: 1. [PayPal](https://www.paypal.com/) [[@r39132](https://github.com/r39132) & [@jhsenjaliya](https://github.com/jhsenjaliya)] 1. [Pernod-Ricard](https://www.pernod-ricard.com/) [[@romain-nio](https://github.com/romain-nio) 1. [Playbuzz](https://www.playbuzz.com/) [[@clintonboys](https://github.com/clintonboys) & [@dbn](https://github.com/dbn)] +1. [Plaid](https://www.plaid.com/) [[@plaid](https://github.com/plaid), [@AustinBGibbons](https://github.com/AustinBGibbons) & [@jeeyoungk](https://github.com/jeeyoungk)] 1. [Postmates](http://www.postmates.com) [[@syeoryn](https://github.com/syeoryn)] 1. [Pronto Tools](http://www.prontotools.io/) [[@zkan](https://github.com/zkan) & [@mesodiar](https://github.com/mesodiar)] 1. [Qubole](https://qubole.com) [[@msumit](https://github.com/msumit)]