[jira] [Commented] (AIRFLOW-584) Airflow Pool does not limit running tasks

2018-02-01 Thread barak schoster (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16348468#comment-16348468
 ] 

barak schoster commented on AIRFLOW-584:


was this solved in a higher version?

> Airflow Pool does not limit running tasks
> -
>
> Key: AIRFLOW-584
> URL: https://issues.apache.org/jira/browse/AIRFLOW-584
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: pools
>Affects Versions: Airflow 1.7.1.3
> Environment: Ubuntu 14.04
>Reporter: David Kegley
>Priority: Major
> Attachments: img1.png, img2.png
>
>
> Airflow pools are not limiting the number of running task instances for the 
> following dag in 1.7.1.3
> Steps to recreate:
> Create a pool of size 5 through the UI.
> The following dag has 52 tasks with increasing priority corresponding to the 
> task number. There should only ever be 5 tasks running at a time however I 
> observed 29 'used slots' in a pool with 5 slots
> {code}
> dag_name = 'pools_bug'
> default_args = {
> 'owner': 'airflow',
> 'depends_on_past': False,
> 'start_date': datetime(2016, 10, 20),
> 'email_on_failure': False,
> 'retries': 1
> }
> dag = DAG(dag_name, default_args=default_args, schedule_interval="0 8 * * *")
> start = DummyOperator(task_id='start', dag=dag)
> end = DummyOperator(task_id='end', dag=dag)
> for i in range(50):
> sleep_command = 'sleep 10'
> task_name = 'task-{}'.format(i)
> op = BashOperator(
> task_id=task_name,
> bash_command=sleep_command,
> execution_timeout=timedelta(hours=4),
> priority_weight=i,
> pool=dag_name,
> dag=dag)
> start.set_downstream(op)
> end.set_upstream(op)
> {code}
> Relevant configurations from airflow.cfg:
> {code}
> [core]
> # The executor class that airflow should use. Choices include
> # SequentialExecutor, LocalExecutor, CeleryExecutor
> executor = CeleryExecutor
> # The amount of parallelism as a setting to the executor. This defines
> # the max number of task instances that should run simultaneously
> # on this airflow installation
> parallelism = 64
> # The number of task instances allowed to run concurrently by the scheduler
> dag_concurrency = 64
> # The maximum number of active DAG runs per DAG
> max_active_runs_per_dag = 1
> [celery]
> # This section only applies if you are using the CeleryExecutor in
> # [core] section above
> # The app name that will be used by celery
> celery_app_name = airflow.executors.celery_executor
> # The concurrency that will be used when starting workers with the
> # "airflow worker" command. This defines the number of task instances that
> # a worker will take, so size up your workers based on the resources on
> # your worker box and the nature of your tasks
> celeryd_concurrency = 64
> [scheduler]
> # Task instances listen for external kill signal (when you clear tasks
> # from the CLI or the UI), this defines the frequency at which they should
> # listen (in seconds).
> job_heartbeat_sec = 5
> # The scheduler constantly tries to trigger new tasks (look at the
> # scheduler section in the docs for more information). This defines
> # how often the scheduler should run (in seconds).
> scheduler_heartbeat_sec = 5
> {code}
> !img1.png!
> !img2.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AIRFLOW-2055) Variable access documentation in templated task instance is ambiguous

2018-02-01 Thread Matthew Bowden (JIRA)
Matthew Bowden created AIRFLOW-2055:
---

 Summary: Variable access documentation in templated task instance 
is ambiguous
 Key: AIRFLOW-2055
 URL: https://issues.apache.org/jira/browse/AIRFLOW-2055
 Project: Apache Airflow
  Issue Type: Improvement
Affects Versions: 1.9.0
Reporter: Matthew Bowden
Assignee: Matthew Bowden


Some of the internal documentation in {{airflow/models.py}} (specifically under 
{{TaskInstance.get_template_context}} is slightly ambiguous. This makes it a 
bit misleading for users attempting to write templates which use {{Variable}}s. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-1793) DockerOperator doesn't work with docker_conn_id

2018-02-01 Thread JIRA

[ 
https://issues.apache.org/jira/browse/AIRFLOW-1793?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349149#comment-16349149
 ] 

Konrad Gołuchowski commented on AIRFLOW-1793:
-

Submitted fix: https://github.com/apache/incubator-airflow/pull/2998

> DockerOperator doesn't work with docker_conn_id
> ---
>
> Key: AIRFLOW-1793
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1793
> Project: Apache Airflow
>  Issue Type: Bug
>Reporter: Cedrik Neumann
>Assignee: Cedrik Neumann
>Priority: Major
>
> The implementation of DockerOperator uses `self.base_url` when loading the 
> DockerHook instead of `self.docker_url`:
> https://github.com/apache/incubator-airflow/blob/v1-9-stable/airflow/operators/docker_operator.py#L150
> {noformat}
> [2017-11-08 16:10:13,082] {base_task_runner.py:98} INFO - Subtask:   File 
> "/src/apache-airflow/airflow/operators/docker_operator.py", line 161, in 
> execute
> [2017-11-08 16:10:13,083] {base_task_runner.py:98} INFO - Subtask: 
> self.cli = self.get_hook().get_conn()
> [2017-11-08 16:10:13,083] {base_task_runner.py:98} INFO - Subtask:   File 
> "/src/apache-airflow/airflow/operators/docker_operator.py", line 150, in 
> get_hook
> [2017-11-08 16:10:13,083] {base_task_runner.py:98} INFO - Subtask: 
> base_url=self.base_url,
> [2017-11-08 16:10:13,083] {base_task_runner.py:98} INFO - Subtask: 
> AttributeError: 'DockerOperator' object has no attribute 'base_url'
> {noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AIRFLOW-2056) Integrate Google Cloud Storage (GCS) operators into 1 file

2018-02-01 Thread Kaxil Naik (JIRA)
Kaxil Naik created AIRFLOW-2056:
---

 Summary: Integrate Google Cloud Storage (GCS) operators into 1 file
 Key: AIRFLOW-2056
 URL: https://issues.apache.org/jira/browse/AIRFLOW-2056
 Project: Apache Airflow
  Issue Type: Improvement
  Components: contrib, gcp
Affects Versions: Airflow 2.0, 2.0.0
Reporter: Kaxil Naik
Assignee: Kaxil Naik
 Fix For: Airflow 2.0, 2.0.0


There are currently 5 operators:
* GoogleCloudStorageCopyOperator
* GoogleCloudStorageDownloadOperator
* GoogleCloudStorageListOperator
* GoogleCloudStorageToBigQueryOperator
* GoogleCloudStorageToGoogleCloudStorageOperator

It would be ideal to have 1 file *gcs_operator.py* similar to 
*dataproc_operator.py* containing all the operators related to Google Cloud 
Storage.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (AIRFLOW-1665) Airflow webserver/scheduler don't handle database disconnects (mysql)

2018-02-01 Thread zgl (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-1665?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zgl reassigned AIRFLOW-1665:


Assignee: zgl  (was: Vasanth Kumar)

> Airflow webserver/scheduler don't handle database disconnects (mysql)
> -
>
> Key: AIRFLOW-1665
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1665
> Project: Apache Airflow
>  Issue Type: Bug
>Affects Versions: Airflow 1.8
>Reporter: Vasanth Kumar
>Assignee: zgl
>Priority: Major
>  Labels: database, reconnect
> Fix For: 1.9.1
>
>
> Airflow webserver & scheduler don't handle database disconnects.  The process 
> appear to error out and either exit or are left in an off state.  This was 
> observed when using mysql.
> I don't see any database reconnect configuration or code.
> Stack tace for scheduler:
>   File "./MySQLdb/connections.py", line 204, in __init__
> super(Connection, self).__init__(*args, **kwargs2)
> sqlalchemy.exc.OperationalError: (_mysql_exceptions.OperationalError) (2002, 
> "Can't connect to local MySQL server through socket '/tmp/mysql.sock' (2)")



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (AIRFLOW-1012) Add run_as_script option so jinja templating can be used for sql parameter

2018-02-01 Thread zgl (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-1012?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zgl reassigned AIRFLOW-1012:


Assignee: zgl  (was: Ruslan Dautkhanov)

> Add run_as_script option so jinja templating can be used for sql parameter
> --
>
> Key: AIRFLOW-1012
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1012
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: core, db
>Affects Versions: Airflow 1.8
>Reporter: Ruslan Dautkhanov
>Assignee: zgl
>Priority: Major
>  Labels: database, improvement, operators, sql
>
> It would be great to extend jinja templating to sql parameter for SQL 
> Operators.
> With this improvement, it's possible to have extended Jinja template like 
> below that generates multiple SQL statements that can be passed as a single 
> 'sql' parameter, separated by ';' separator:
> {noformat}
> )
> >> OracleOperator( task_id='give_owner_grants', oracle_conn_id=ora_conn1, 
> run_as_script=True,
> sql='''
>   {% for role in ['CONNECT', 'RESOURCE'] %}
>   GRANT {{ role }} TO {{ schema }};
>   {% endfor %}
>   {% for create_grant in ['PROCEDURE', 'SEQUENCE', 'SESSION', 
> 'TABLE', 'VIEW'] %}
>   GRANT CREATE {{ create_grant }} TO {{ schema }};
>   {% endfor %}
>   {% for tbsp in ['DISCOVER_MART_IDX01', 'DISCOVER_MART_TBS01', 
> 'STAGING_NOLOG'] %}
>   ALTER USER {{ schema }} QUOTA UNLIMITED ON {{ tbsp }};
>   {% endfor %}
>   GRANT SELECT ANY TABLE TO {{ schema }};
>   GRANT EXECUTE ON SYS.DBMS_SESSION TO {{ schema }};
> '''
> )
> >> DummyOperator(task_id='stop')
> {noformat}
> Notice there are three Jinja 'for' loops that generate multiple SQL DDL 
> statements. 
> Without this change, sql has to be passed as an Python array, and Jinja 
> templating can't be used.
> I've tested this change with OracleOperator and works as expected. 
> Notice `run_as_script=True` parameter. run_as_script defaults to False so 
> this is a backward-compatible change.
> Most of the change is in airflow/hooks/dbapi_hook.py (very straightforward as 
> run() already supports running an array of statements) and a light change of 
> airflow/operators/oracle_operator.py - so this change can be easily applied 
> to other sql operators.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (AIRFLOW-18) Alembic's constraints and indexes are unnamed thus hard to drop or change

2018-02-01 Thread zgl (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-18?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zgl reassigned AIRFLOW-18:
--

Assignee: zgl

> Alembic's constraints and indexes are unnamed thus hard to drop or change
> -
>
> Key: AIRFLOW-18
> URL: https://issues.apache.org/jira/browse/AIRFLOW-18
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: db
>Reporter: Bolke de Bruin
>Assignee: zgl
>Priority: Major
>  Labels: database
>
> Eg. in XXX_add_dagrun.py the constraint is added without a name:
> sa.UniqueConstraint('dag_id', 'execution_date'),
> This makes constraint naming database specific, ie. postgres' name for the 
> constraint be different than mysql's and sqllite's.
> Best practice per http://alembic.readthedocs.io/en/latest/naming.html is to 
> have naming conventions that are being applied. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (AIRFLOW-20) Improving the scheduler by making dag runs more coherent

2018-02-01 Thread zgl (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-20?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

zgl reassigned AIRFLOW-20:
--

Assignee: zgl  (was: Jin Mingjian)

> Improving the scheduler by making dag runs more coherent
> 
>
> Key: AIRFLOW-20
> URL: https://issues.apache.org/jira/browse/AIRFLOW-20
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: scheduler
>Reporter: Bolke de Bruin
>Assignee: zgl
>Priority: Major
>  Labels: backfill, database, scheduler
>
> The need to align the start_date with the interval is counter intuitive
> and leads to a lot of questions and issue creation, although it is in the 
> documentation. If we are
> able to fix this with none or little consequences for current setups that 
> should be preferred, I think.
> The dependency explainer is really great work, but it doesn’t address the 
> core issue.
> If you consider a DAG a description of cohesion between work items (in OOP 
> java terms
> a class), then a DagRun is the instantiation of a DAG in time (in OOP java 
> terms an instance). 
> Tasks are then the description of a work item and a TaskInstance the 
> instantiation of the Task in time.
> In my opinion issues pop up due to the current paradigm of considering the 
> TaskInstance
> the smallest unit of work and asking it to maintain its own state in relation 
> to other TaskInstances
> in a DagRun and in a previous DagRun of which it has no (real) perception. 
> Tasks are instantiated
> by a cartesian product with the dates of DagRun instead of the DagRuns 
> itself. 
> The very loose coupling between DagRuns and TaskInstances can be improved 
> while maintaining
> flexibility to run tasks without a DagRun. This would help with a couple of 
> things:
> 1. start_date can be used as a ‘execution_date’ or a point in time when to 
> start looking
> 2. a new interval for a dag will maintain depends_on_past
> 3. paused dags do not give trouble
> 4. tasks will be executed in order 
> 5. the ignore_first_depend_on_past could be removed as a task will now know 
> if it is really the first
> In PR-1431 a lot of this work has been done by:
> 1. Adding a “previous” field to a DagRun allowing it to connect to its 
> predecessor
> 2. Adding a dag_run_id to TaskInstances so a TaskInstance knows about the 
> DagRun if needed
> 3. Using start_date + interval as the first run date unless start_date is on 
> the interval then start_date is the first run date



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AIRFLOW-2057) Add Overstock to the list of Airflow users

2018-02-01 Thread Joy Gao (JIRA)
Joy Gao created AIRFLOW-2057:


 Summary: Add Overstock to the list of Airflow users
 Key: AIRFLOW-2057
 URL: https://issues.apache.org/jira/browse/AIRFLOW-2057
 Project: Apache Airflow
  Issue Type: Task
Reporter: Joy Gao






--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AIRFLOW-2058) Scheduler uses MainThread for DAG file processing

2018-02-01 Thread Yang Pan (JIRA)
Yang Pan created AIRFLOW-2058:
-

 Summary: Scheduler uses MainThread for DAG file processing
 Key: AIRFLOW-2058
 URL: https://issues.apache.org/jira/browse/AIRFLOW-2058
 Project: Apache Airflow
  Issue Type: Bug
  Components: DAG
Affects Versions: 1.9.0
 Environment: Ubuntu, Airflow 1.9, Sequential executor
Reporter: Yang Pan


By reading the [source code 
|https://github.com/apache/incubator-airflow/blob/61ff29e578d1121ab4606fe122fb4e2db8f075b9/airflow/utils/dag_processing.py#L538]
 it appears the scheduler will process each DAG file, either a .py or .zip, 
using a new process. 
 
If I understand correctly, in theory what should happen in terms of processing 
a .zip file is that the dedicated process will add the .zip file to the 
PYTHONPATH, and load the file's module and dependency. When the DAG read is 
done, the process gets destroyed. And since the PYTHONPATH is process scoped, 
it won't pollute other processes.
 
However by printing out the threads and process id, it looks like Airflow 
scheduler can sometimes accidentally pick up the main process instead of 
creating a new one, and that's when collision happens.
 
Here is snippet of the PYTHONPATH when advanced_dag_dependency-1.zip is being 
processed. As you can see when it's executed by MainThread, it contains other 
.zip files. When it's using dedicated thread, only required .zip is added.
 
sys.path :['/root/airflow/dags/yang_subdag_2.zip', 
'/root/airflow/dags/yang_subdag_2.zip', '/root/airflow/dags/yang_subdag_1.zip', 
'/root/airflow/dags/yang_subdag_1.zip', 
'/root/airflow/dags/advanced_dag_dependency-2.zip', 
'/root/airflow/dags/advanced_dag_dependency-2.zip', 
'/root/airflow/dags/advanced_dag_dependency-1.zip', 
'/root/airflow/dags/advanced_dag_dependency-1.zip', 
'/root/airflow/dags/yang_subdag_1', '/usr/local/bin', '/usr/lib/python2.7', 
'/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', 
'/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', 
'/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages', 
'/usr/lib/python2.7/dist-packages/PILcompat', '/root/airflow/config', 
'/root/airflow/dags', '/root/airflow/plugins'] 
Print from MyFirstOperator in Dag 1 
process id: 5059 
thread id: <_MainThread(*MainThread*, started 140339858560768)> 
 
sys.path :[u'/root/airflow/dags/advanced_dag_dependency-1.zip', 
'/usr/local/bin', '/usr/lib/python2.7', 
'/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', 
'/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', 
'/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages', 
'/usr/lib/python2.7/dist-packages/PILcompat', '/root/airflow/config', 
'/root/airflow/dags', '/root/airflow/plugins'] 
Print from MyFirstOperator in Dag 1 
process id: 5076 
thread id: <_MainThread(*DagFileProcessor283*, started 140137838294784)> 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


incubator-airflow git commit: [AIRFLOW-2057] Add Overstock to list of companies

2018-02-01 Thread sanand
Repository: incubator-airflow
Updated Branches:
  refs/heads/master 6d88744be -> ba0b1978d


[AIRFLOW-2057] Add Overstock to list of companies

Closes #3001 from mhousley/add-overstock-to-list


Project: http://git-wip-us.apache.org/repos/asf/incubator-airflow/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-airflow/commit/ba0b1978
Tree: http://git-wip-us.apache.org/repos/asf/incubator-airflow/tree/ba0b1978
Diff: http://git-wip-us.apache.org/repos/asf/incubator-airflow/diff/ba0b1978

Branch: refs/heads/master
Commit: ba0b1978d3c30d4f582115d18953969ecf6ba1ee
Parents: 6d88744
Author: Matthew Housley 
Authored: Thu Feb 1 17:17:18 2018 -0800
Committer: Siddharth Anand 
Committed: Thu Feb 1 17:17:24 2018 -0800

--
 README.md | 1 +
 1 file changed, 1 insertion(+)
--


http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/ba0b1978/README.md
--
diff --git a/README.md b/README.md
index 3584d34..c338c17 100644
--- a/README.md
+++ b/README.md
@@ -169,6 +169,7 @@ Currently **officially** using Airflow:
 1. [OfferUp](https://offerupnow.com)
 1. [OneFineStay](https://www.onefinestay.com) 
[[@slangwald](https://github.com/slangwald)]
 1. [Open Knowledge International](https://okfn.org) 
[@vitorbaptista](https://github.com/vitorbaptista)
+1. [Overstock](https://www.github.com/overstock) 
[[@mhousley](https://github.com/mhousley) & 
[@mct0006](https://github.com/mct0006)]
 1. [Pandora Media](https://www.pandora.com/) 
[[@Acehaidrey](https://github.com/Acehaidrey)]
 1. [PAYMILL](https://www.paymill.com/) [[@paymill](https://github.com/paymill) 
& [@matthiashuschle](https://github.com/matthiashuschle)]
 1. [PayPal](https://www.paypal.com/) [[@r39132](https://github.com/r39132) & 
[@jhsenjaliya](https://github.com/jhsenjaliya)]



[jira] [Commented] (AIRFLOW-2057) Add Overstock to the list of Airflow users

2018-02-01 Thread ASF subversion and git services (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349624#comment-16349624
 ] 

ASF subversion and git services commented on AIRFLOW-2057:
--

Commit ba0b1978d3c30d4f582115d18953969ecf6ba1ee in incubator-airflow's branch 
refs/heads/master from [~mhousley]
[ https://git-wip-us.apache.org/repos/asf?p=incubator-airflow.git;h=ba0b197 ]

[AIRFLOW-2057] Add Overstock to list of companies

Closes #3001 from mhousley/add-overstock-to-list


> Add Overstock to the list of Airflow users
> --
>
> Key: AIRFLOW-2057
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2057
> Project: Apache Airflow
>  Issue Type: Task
>Reporter: Joy Gao
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRFLOW-2058) Scheduler uses MainThread for DAG file processing

2018-02-01 Thread Yang Pan (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yang Pan updated AIRFLOW-2058:
--
Priority: Blocker  (was: Major)

> Scheduler uses MainThread for DAG file processing
> -
>
> Key: AIRFLOW-2058
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2058
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: DAG
>Affects Versions: 1.9.0
> Environment: Ubuntu, Airflow 1.9, Sequential executor
>Reporter: Yang Pan
>Priority: Blocker
>
> By reading the [source code 
> |https://github.com/apache/incubator-airflow/blob/61ff29e578d1121ab4606fe122fb4e2db8f075b9/airflow/utils/dag_processing.py#L538]
>  it appears the scheduler will process each DAG file, either a .py or .zip, 
> using a new process. 
>  
> If I understand correctly, in theory what should happen in terms of 
> processing a .zip file is that the dedicated process will add the .zip file 
> to the PYTHONPATH, and load the file's module and dependency. When the DAG 
> read is done, the process gets destroyed. And since the PYTHONPATH is process 
> scoped, it won't pollute other processes.
>  
> However by printing out the threads and process id, it looks like Airflow 
> scheduler can sometimes accidentally pick up the main process instead of 
> creating a new one, and that's when collision happens.
>  
> Here is snippet of the PYTHONPATH when advanced_dag_dependency-1.zip is being 
> processed. As you can see when it's executed by MainThread, it contains other 
> .zip files. When it's using dedicated thread, only required .zip is added.
>  
> sys.path :['/root/airflow/dags/yang_subdag_2.zip', 
> '/root/airflow/dags/yang_subdag_2.zip', 
> '/root/airflow/dags/yang_subdag_1.zip', 
> '/root/airflow/dags/yang_subdag_1.zip', 
> '/root/airflow/dags/advanced_dag_dependency-2.zip', 
> '/root/airflow/dags/advanced_dag_dependency-2.zip', 
> '/root/airflow/dags/advanced_dag_dependency-1.zip', 
> '/root/airflow/dags/advanced_dag_dependency-1.zip', 
> '/root/airflow/dags/yang_subdag_1', '/usr/local/bin', '/usr/lib/python2.7', 
> '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', 
> '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', 
> '/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages', 
> '/usr/lib/python2.7/dist-packages/PILcompat', '/root/airflow/config', 
> '/root/airflow/dags', '/root/airflow/plugins'] 
> Print from MyFirstOperator in Dag 1 
> process id: 5059 
> thread id: <_MainThread(*MainThread*, started 140339858560768)> 
>  
> sys.path :[u'/root/airflow/dags/advanced_dag_dependency-1.zip', 
> '/usr/local/bin', '/usr/lib/python2.7', 
> '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', 
> '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', 
> '/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages', 
> '/usr/lib/python2.7/dist-packages/PILcompat', '/root/airflow/config', 
> '/root/airflow/dags', '/root/airflow/plugins'] 
> Print from MyFirstOperator in Dag 1 
> process id: 5076 
> thread id: <_MainThread(*DagFileProcessor283*, started 140137838294784)> 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2058) Scheduler uses MainThread for DAG file processing

2018-02-01 Thread Yang Pan (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2058?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16349694#comment-16349694
 ] 

Yang Pan commented on AIRFLOW-2058:
---

Impact wise, this causes dependency collision when DAG is being loaded. 

> Scheduler uses MainThread for DAG file processing
> -
>
> Key: AIRFLOW-2058
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2058
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: DAG
>Affects Versions: 1.9.0
> Environment: Ubuntu, Airflow 1.9, Sequential executor
>Reporter: Yang Pan
>Priority: Major
>
> By reading the [source code 
> |https://github.com/apache/incubator-airflow/blob/61ff29e578d1121ab4606fe122fb4e2db8f075b9/airflow/utils/dag_processing.py#L538]
>  it appears the scheduler will process each DAG file, either a .py or .zip, 
> using a new process. 
>  
> If I understand correctly, in theory what should happen in terms of 
> processing a .zip file is that the dedicated process will add the .zip file 
> to the PYTHONPATH, and load the file's module and dependency. When the DAG 
> read is done, the process gets destroyed. And since the PYTHONPATH is process 
> scoped, it won't pollute other processes.
>  
> However by printing out the threads and process id, it looks like Airflow 
> scheduler can sometimes accidentally pick up the main process instead of 
> creating a new one, and that's when collision happens.
>  
> Here is snippet of the PYTHONPATH when advanced_dag_dependency-1.zip is being 
> processed. As you can see when it's executed by MainThread, it contains other 
> .zip files. When it's using dedicated thread, only required .zip is added.
>  
> sys.path :['/root/airflow/dags/yang_subdag_2.zip', 
> '/root/airflow/dags/yang_subdag_2.zip', 
> '/root/airflow/dags/yang_subdag_1.zip', 
> '/root/airflow/dags/yang_subdag_1.zip', 
> '/root/airflow/dags/advanced_dag_dependency-2.zip', 
> '/root/airflow/dags/advanced_dag_dependency-2.zip', 
> '/root/airflow/dags/advanced_dag_dependency-1.zip', 
> '/root/airflow/dags/advanced_dag_dependency-1.zip', 
> '/root/airflow/dags/yang_subdag_1', '/usr/local/bin', '/usr/lib/python2.7', 
> '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', 
> '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', 
> '/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages', 
> '/usr/lib/python2.7/dist-packages/PILcompat', '/root/airflow/config', 
> '/root/airflow/dags', '/root/airflow/plugins'] 
> Print from MyFirstOperator in Dag 1 
> process id: 5059 
> thread id: <_MainThread(*MainThread*, started 140339858560768)> 
>  
> sys.path :[u'/root/airflow/dags/advanced_dag_dependency-1.zip', 
> '/usr/local/bin', '/usr/lib/python2.7', 
> '/usr/lib/python2.7/plat-x86_64-linux-gnu', '/usr/lib/python2.7/lib-tk', 
> '/usr/lib/python2.7/lib-old', '/usr/lib/python2.7/lib-dynload', 
> '/usr/local/lib/python2.7/dist-packages', '/usr/lib/python2.7/dist-packages', 
> '/usr/lib/python2.7/dist-packages/PILcompat', '/root/airflow/config', 
> '/root/airflow/dags', '/root/airflow/plugins'] 
> Print from MyFirstOperator in Dag 1 
> process id: 5076 
> thread id: <_MainThread(*DagFileProcessor283*, started 140137838294784)> 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRFLOW-2057) Add Overstock to the list of Airflow users

2018-02-01 Thread Siddharth Anand (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Anand updated AIRFLOW-2057:
-
External issue URL: https://issues.apache.org/jira/browse/AIRFLOW-2057

> Add Overstock to the list of Airflow users
> --
>
> Key: AIRFLOW-2057
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2057
> Project: Apache Airflow
>  Issue Type: Task
>Reporter: Joy Gao
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (AIRFLOW-2057) Add Overstock to the list of Airflow users

2018-02-01 Thread Siddharth Anand (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Siddharth Anand reassigned AIRFLOW-2057:


Assignee: Siddharth Anand

> Add Overstock to the list of Airflow users
> --
>
> Key: AIRFLOW-2057
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2057
> Project: Apache Airflow
>  Issue Type: Task
>Reporter: Joy Gao
>Assignee: Siddharth Anand
>Priority: Trivial
>




--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


incubator-airflow git commit: [AIRFLOW-XXX] Add Plaid to Airflow users

2018-02-01 Thread fokko
Repository: incubator-airflow
Updated Branches:
  refs/heads/master 6ee4bbd4b -> 6d88744be


[AIRFLOW-XXX] Add Plaid to Airflow users

Closes #2995 from AustinBGibbons/master


Project: http://git-wip-us.apache.org/repos/asf/incubator-airflow/repo
Commit: http://git-wip-us.apache.org/repos/asf/incubator-airflow/commit/6d88744b
Tree: http://git-wip-us.apache.org/repos/asf/incubator-airflow/tree/6d88744b
Diff: http://git-wip-us.apache.org/repos/asf/incubator-airflow/diff/6d88744b

Branch: refs/heads/master
Commit: 6d88744bee498fdb42417e3b82365e126ece76b5
Parents: 6ee4bbd
Author: Austin Gibbons 
Authored: Thu Feb 1 10:04:07 2018 +0100
Committer: Fokko Driesprong 
Committed: Thu Feb 1 10:04:14 2018 +0100

--
 README.md | 1 +
 1 file changed, 1 insertion(+)
--


http://git-wip-us.apache.org/repos/asf/incubator-airflow/blob/6d88744b/README.md
--
diff --git a/README.md b/README.md
index 56c9cea..3584d34 100644
--- a/README.md
+++ b/README.md
@@ -174,6 +174,7 @@ Currently **officially** using Airflow:
 1. [PayPal](https://www.paypal.com/) [[@r39132](https://github.com/r39132) & 
[@jhsenjaliya](https://github.com/jhsenjaliya)]
 1. [Pernod-Ricard](https://www.pernod-ricard.com/) 
[[@romain-nio](https://github.com/romain-nio) 
 1. [Playbuzz](https://www.playbuzz.com/) 
[[@clintonboys](https://github.com/clintonboys) & 
[@dbn](https://github.com/dbn)]
+1. [Plaid](https://www.plaid.com/) [[@plaid](https://github.com/plaid), 
[@AustinBGibbons](https://github.com/AustinBGibbons) & 
[@jeeyoungk](https://github.com/jeeyoungk)]
 1. [Postmates](http://www.postmates.com) 
[[@syeoryn](https://github.com/syeoryn)]
 1. [Pronto Tools](http://www.prontotools.io/) 
[[@zkan](https://github.com/zkan) & [@mesodiar](https://github.com/mesodiar)]
 1. [Qubole](https://qubole.com) [[@msumit](https://github.com/msumit)]