[jira] [Commented] (AIRFLOW-3347) Unable to configure Kubernetes secrets through environment

2018-11-14 Thread Chris Bandy (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16687005#comment-16687005
 ] 

Chris Bandy commented on AIRFLOW-3347:
--

I was able to track this down. The following in kubernetes_executor was always 
returning empty OrderedDict()

{code:python}
self.kube_secrets = configuration_dict.get('kubernetes_secrets', {})
{code}

The following line in AirflowConfigParser was not prepared for double 
underscores within/beneath a configuration section. Limiting the split seemed 
to do the trick.

{code:python}
diff --git a/airflow/configuration.py b/airflow/configuration.py
index 2e05fde0..4c923b80 100644
--- a/airflow/configuration.py
+++ b/airflow/configuration.py
@@ -358,7 +358,7 @@ class AirflowConfigParser(ConfigParser):
 # add env vars and overwrite because they have priority
 for ev in [ev for ev in os.environ if ev.startswith('AIRFLOW__')]:
 try:
-_, section, key = ev.split('__')
+_, section, key = ev.split('__', 2)
 opt = self._get_env_var_option(section, key)
 except ValueError:
 opt = None
{code}

> Unable to configure Kubernetes secrets through environment
> --
>
> Key: AIRFLOW-3347
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3347
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: configuration, kubernetes
>Affects Versions: 1.10.0
>Reporter: Chris Bandy
>Priority: Major
>
> We configure Airflow through environment variables. While setting up the 
> Kubernetes Executor, we wanted to pass the SQL Alchemy connection string to 
> workers by including it the {{kubernetes_secrets}} section of config.
> Unfortunately, even with 
> {{AIRFLOW_\_KUBERNETES_SECRETS_\_AIRFLOW_\_CORE_\_SQL_ALCHEMY_CONN}} set in 
> the scheduler environment, the worker gets no environment secret environment 
> variables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRFLOW-3347) Unable to configure Kubernetes secrets through environment

2018-11-14 Thread Chris Bandy (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-3347?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bandy updated AIRFLOW-3347:
-
Description: 
We configure Airflow through environment variables. While setting up the 
Kubernetes Executor, we wanted to pass the SQL Alchemy connection string to 
workers by including it the {{kubernetes_secrets}} section of config.

Unfortunately, even with 
{{AIRFLOW_\_KUBERNETES_SECRETS_\_AIRFLOW_\_CORE_\_SQL_ALCHEMY_CONN}} set in the 
scheduler environment, the worker gets no environment secret environment 
variables.

  was:
We configure Airflow through environment variables. While setting up the 
Kubernetes Executor, we wanted to pass the SQL Alchemy connection string to 
workers by including it the {{kubernetes_secrets}} section of config.

Unfortunately, even with 
{{AIRFLOW__KUBERNETES_SECRETS__AIRFLOW__CORE__SQL_ALCHEMY_CONN}} set in the 
scheduler environment, the worker gets no environment secret environment 
variables.


> Unable to configure Kubernetes secrets through environment
> --
>
> Key: AIRFLOW-3347
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3347
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: configuration, kubernetes
>Affects Versions: 1.10.0
>Reporter: Chris Bandy
>Priority: Major
>
> We configure Airflow through environment variables. While setting up the 
> Kubernetes Executor, we wanted to pass the SQL Alchemy connection string to 
> workers by including it the {{kubernetes_secrets}} section of config.
> Unfortunately, even with 
> {{AIRFLOW_\_KUBERNETES_SECRETS_\_AIRFLOW_\_CORE_\_SQL_ALCHEMY_CONN}} set in 
> the scheduler environment, the worker gets no environment secret environment 
> variables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AIRFLOW-3347) Unable to configure Kubernetes secrets through environment

2018-11-14 Thread Chris Bandy (JIRA)
Chris Bandy created AIRFLOW-3347:


 Summary: Unable to configure Kubernetes secrets through environment
 Key: AIRFLOW-3347
 URL: https://issues.apache.org/jira/browse/AIRFLOW-3347
 Project: Apache Airflow
  Issue Type: Bug
  Components: configuration, kubernetes
Affects Versions: 1.10.0
Reporter: Chris Bandy


We configure Airflow through environment variables. While setting up the 
Kubernetes Executor, we wanted to pass the SQL Alchemy connection string to 
workers by including it the {{kubernetes_secrets}} section of config.

Unfortunately, even with 
{{AIRFLOW__KUBERNETES_SECRETS__AIRFLOW__CORE__SQL_ALCHEMY_CONN}} set in the 
scheduler environment, the worker gets no environment secret environment 
variables.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2143) Try number displays incorrect values in the web UI

2018-11-08 Thread Chris Bandy (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679843#comment-16679843
 ] 

Chris Bandy commented on AIRFLOW-2143:
--

Affects 1.10.0 as well.

> Try number displays incorrect values in the web UI
> --
>
> Key: AIRFLOW-2143
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2143
> Project: Apache Airflow
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: James Davidheiser
>Priority: Minor
> Attachments: adhoc_query.png, task_instance_page.png
>
>
> This was confusing us a lot in our task runs - in the database, a task that 
> ran is marked as 1 try.  However, when we view it in the UI, it shows at 2 
> tries in several places.  These include:
>  * Task Instance Details (ie 
> [https://airflow/task?execution_date=xxx&dag_id=xxx&task_id=xxx 
> )|https://airflow/task?execution_date=xxx&dag_id=xxx&task_id=xxx]
>  * Task instance browser (/admin/taskinstance/)
>  * Task Tries graph (/admin/airflow/tries)
> Notably, is is correctly shown as 1 try in the log filenames, on the log 
> viewer page (admin/airflow/log?execution_date=), and some other places.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AIRFLOW-3312) No log output from BashOperator under test

2018-11-08 Thread Chris Bandy (JIRA)
Chris Bandy created AIRFLOW-3312:


 Summary: No log output from BashOperator under test
 Key: AIRFLOW-3312
 URL: https://issues.apache.org/jira/browse/AIRFLOW-3312
 Project: Apache Airflow
  Issue Type: Bug
  Components: logging, operators
Affects Versions: 1.10.0
Reporter: Chris Bandy


The BashOperator logs some messages as well as the stdout of its command at the 
info level, but none of these appear when running {{airflow test}} with the 
default configuration.

For example, this DAG emits the following in Airflow 1.10.0:
{code:python}
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from datetime import datetime

dag = DAG('please', start_date=datetime(year=2018, month=11, day=1))

BashOperator(dag=dag, task_id='mine', bash_command='echo thank you')
{code}

{noformat}
$ airflow test please mine '2018-11-01'
[2018-11-08 00:06:54,098] {__init__.py:51} INFO - Using executor 
SequentialExecutor
[2018-11-08 00:06:54,246] {models.py:258} INFO - Filling up the DagBag from 
/usr/local/airflow/dags
{noformat}

When executed by the scheduler, logs go to a file:

{noformat}
$ airflow scheduler -n 1
...
[2018-11-08 00:41:02,674] {dag_processing.py:582} INFO - Started a process 
(PID: 9) to generate tasks for /usr/local/airflow/dags/please.py
[2018-11-08 00:41:03,185] {dag_processing.py:495} INFO - Processor for 
/usr/local/airflow/dags/please.py finished
[2018-11-08 00:41:03,525] {jobs.py:1114} INFO - Tasks up for execution:

[2018-11-08 00:41:03,536] {jobs.py:1147} INFO - Figuring out tasks to run in 
Pool(name=None) with 128 open slots and 1 task instances in queue
[2018-11-08 00:41:03,539] {jobs.py:1184} INFO - DAG please has 0/16 running and 
queued tasks
[2018-11-08 00:41:03,540] {jobs.py:1216} INFO - Setting the follow tasks to 
queued state:

[2018-11-08 00:41:03,573] {jobs.py:1297} INFO - Setting the follow tasks to 
queued state:

[2018-11-08 00:41:03,576] {jobs.py:1339} INFO - Sending ('please', 'mine', 
datetime.datetime(2018, 11, 1, 0, 0, tzinfo=)) to executor with 
priority 1 and queue default
[2018-11-08 00:41:03,578] {base_executor.py:56} INFO - Adding to queue: airflow 
run please mine 2018-11-01T00:00:00+00:00 --local -sd 
/usr/local/airflow/dags/please.py
[2018-11-08 00:41:03,593] {sequential_executor.py:45} INFO - Executing command: 
airflow run please mine 2018-11-01T00:00:00+00:00 --local -sd 
/usr/local/airflow/dags/please.py
[2018-11-08 00:41:04,262] {__init__.py:51} INFO - Using executor 
SequentialExecutor
[2018-11-08 00:41:04,406] {models.py:258} INFO - Filling up the DagBag from 
/usr/local/airflow/dags/please.py
[2018-11-08 00:41:04,458] {cli.py:492} INFO - Running  on host e2e08cf4dfaa
[2018-11-08 00:41:09,684] {jobs.py:1443} INFO - Executor reports please.mine 
execution_date=2018-11-01 00:00:00+00:00 as success

$ cat logs/please/mine/2018-11-01T00\:00\:00+00\:00/1.log
[2018-11-08 00:41:04,554] {models.py:1335} INFO - Dependencies all met for 

[2018-11-08 00:41:04,564] {models.py:1335} INFO - Dependencies all met for 

[2018-11-08 00:41:04,565] {models.py:1547} INFO -

Starting attempt 1 of 1


[2018-11-08 00:41:04,605] {models.py:1569} INFO - Executing 
 on 2018-11-01T00:00:00+00:00
[2018-11-08 00:41:04,605] {base_task_runner.py:124} INFO - Running: ['bash', 
'-c', 'airflow run please mine 2018-11-01T00:00:00+00:00 --job_id 142 --raw -sd 
DAGS_FOLDER/please.py --cfg_path /tmp/tmp9prq7knr']
[2018-11-08 00:41:05,214] {base_task_runner.py:107} INFO - Job 142: Subtask 
mine [2018-11-08 00:41:05,213] {__init__.py:51} INFO - Using executor 
SequentialExecutor
[2018-11-08 00:41:05,334] {base_task_runner.py:107} INFO - Job 142: Subtask 
mine [2018-11-08 00:41:05,333] {models.py:258} INFO - Filling up the DagBag 
from /usr/local/airflow/dags/please.py
[2018-11-08 00:41:05,368] {base_task_runner.py:107} INFO - Job 142: Subtask 
mine [2018-11-08 00:41:05,367] {cli.py:492} INFO - Running  on host e2e08cf4dfaa
[2018-11-08 00:41:05,398] {bash_operator.py:74} INFO - Tmp dir root location:
 /tmp
[2018-11-08 00:41:05,398] {bash_operator.py:87} INFO - Temporary script 
location: /tmp/airflowtmp0is6wwxi/mine8tmew5y4
[2018-11-08 00:41:05,399] {bash_operator.py:97} INFO - Running command: echo 
thank you
[2018-11-08 00:41:05,402] {bash_operator.py:106} INFO - Output:
[2018-11-08 00:41:05,404] {bash_operator.py:110} INFO - thank you
[2018-11-08 00:41:05,404] {bash_operator.py:114} INFO - Command exited with 
return code 0
[2018-11-08 00:41:09,504] {logging_mixin.py:95} INFO - [2018-11-08 
00:41:09,503] {jobs.py:2612} INFO - Task exited with return code 0
{noformat}

 


This appears to be a regression. In Airflow 1.9.0, the same DAG with default 
configuration emi

[jira] [Commented] (AIRFLOW-3299) Logs for currently running sensors not visible in the UI

2018-11-08 Thread Chris Bandy (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-3299?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16679846#comment-16679846
 ] 

Chris Bandy commented on AIRFLOW-3299:
--

Possibly related to AIRFLOW-2143?

> Logs for currently running sensors not visible in the UI
> 
>
> Key: AIRFLOW-3299
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3299
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: ui
>Reporter: Brad Holmes
>Priority: Major
>
> When a task is actively running, the logs are not appearing.  I have tracked 
> this down to the {{next_try_number}} logic of task-instances.
> In [the view at line 
> 836|https://github.com/apache/incubator-airflow/blame/master/airflow/www/views.py#L836],
>  we have
> {code:java}
> logs = [''] * (ti.next_try_number - 1 if ti is not None else 0)
> {code}
> The length of the {{logs}} array informs the frontend on the number of 
> {{attempts}} that exist, and thus how many AJAX calls to make to load the 
> logs.
> Here is the current logic I have observed
> ||Task State||Current length of 'logs'||Needed length of 'logs'||
> |Successfully completed in 1 attempt|1|1|
> |Successfully completed in 2 attempt|2|2|
> |Not yet attempted|0|0|
> |Actively running task, first time|0|1|
> That last case is the bug.  Perhaps task-instance needs a method like 
> {{most_recent_try_number}} ?  I don't see how to make use of {{try_number()}} 
> or {{next_try_number()}} to meet the need here.
> ||Task State||try_number()||next_try_number()||Number of Attempts _Should_ 
> Display||
> |Successfully completed in 1 attempt|2|2|1|
> |Successfully completed in 2 attempt|3|3|2|
> |Not yet attempted|1|1|0|
> |Actively running task, first time|0|1|1|
> [~ashb] : You implemented this portion of task-instance 11 months ago.  Any 
> suggestions?  Or perhaps the problem is elsewhere?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-593) Tasks do not get backfilled sequentially

2018-07-20 Thread Chris Bandy (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16551139#comment-16551139
 ] 

Chris Bandy commented on AIRFLOW-593:
-

I can see that every instance of `load_transactions` believes it is the first 
instance:

{noformat}
[2018-07-20 18:23:53,724] {models.py:1216} DEBUG -  dependency 'Previous 
Dagrun State' PASSED: True, This task instance was the first task instance for 
its task.

[2018-07-20 18:23:53,774] {models.py:1216} DEBUG -  dependency 'Previous 
Dagrun State' PASSED: True, This task instance was the first task instance for 
its task.

[2018-07-20 18:24:33,619] {models.py:1216} DEBUG -  dependency 'Previous 
Dagrun State' PASSED: True, This task instance was the first task instance for 
its task.

[2018-07-20 18:24:33,689] {models.py:1216} DEBUG -  dependency 'Previous 
Dagrun State' PASSED: True, This task instance was the first task instance for 
its task.

[2018-07-20 18:24:59,968] {models.py:1216} DEBUG -  dependency 'Previous 
Dagrun State' PASSED: True, This task instance was the first task instance for 
its task.
{noformat}


> Tasks do not get backfilled sequentially
> 
>
> Key: AIRFLOW-593
> URL: https://issues.apache.org/jira/browse/AIRFLOW-593
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: DagRun, scheduler
>Affects Versions: Airflow 1.7.1.3
>Reporter: Jong Kim
>Priority: Minor
> Attachments: Screen Shot 2018-07-20 at 10.04.24 AM.png
>
>
> I need to have the tasks within a DAG complete in order when running 
> backfills. I am running on my mac locally using SequentialExecutor.
> Let's say I have a DAG running daily at 11AM UTC (0 11 * * *) with a 
> start_date: datetime(2016, 10, 20, 11, 0, 0). The DAG consists of 3 tasks, 
> which must complete in order. task0 -> task1 -> task2. This dependency is set 
> using .set_downstream().
> Today (2016/10/22) I reset the database, turn-on the DAGrun using the on/off 
> toggle in the webserver, and issue "airflow scheduler", which will 
> automatically backfill starting from start_date.
> It will backfill for 2016/10/20 and 2016/10/21.  I expect backfill to run 
> like the following sequentially:
> datetime(2016, 10, 20, 11, 0, 0) task0
> datetime(2016, 10, 20, 11, 0, 0) task1
> datetime(2016, 10, 20, 11, 0, 0) task2
> datetime(2016, 10, 21, 11, 0, 0) task0
> datetime(2016, 10, 21, 11, 0, 0) task1
> datetime(2016, 10, 21, 11, 0, 0) task2
> With 'depends_on_past': False, I see Airflow running tasks grouped by 
> sequence number something like this, which is not what I want:
> datetime(2016, 10, 20, 11, 0, 0) task0
> datetime(2016, 10, 21, 11, 0, 0) task0
> datetime(2016, 10, 20, 11, 0, 0) task1
> datetime(2016, 10, 21, 11, 0, 0) task1
> datetime(2016, 10, 20, 11, 0, 0) task2
> datetime(2016, 10, 21, 11, 0, 0) task2
> With 'depends_on_past': True and 'wait_for_downstream': True, I expect it to 
> run like what I need to, but instead it runs some tasks out of order like 
> this:
> datetime(2016, 10, 20, 11, 0, 0) task0
> datetime(2016, 10, 20, 11, 0, 0) task1
> datetime(2016, 10, 21, 11, 0, 0) task0   <- out of order!
> datetime(2016, 10, 20, 11, 0, 0) task2   <- out of order!
> datetime(2016, 10, 21, 11, 0, 0) task1
> datetime(2016, 10, 21, 11, 0, 0) task2
> Is this a bug? If not, am I understanding 'depends_on_past' and 
> 'wait_for_downstream' correctly? What do I need to do?
> The only remedy I can think of is to backfill each date manually.
> Public gist of DAG: 
> https://gist.github.com/jong-eatsa/cba1bf3c182b38e966696da47164faf1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-593) Tasks do not get backfilled sequentially

2018-07-20 Thread Chris Bandy (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16550894#comment-16550894
 ] 

Chris Bandy commented on AIRFLOW-593:
-

https://lists.apache.org/thread.html/ef9ab995d019590eb7b072a74efca2a160b9a4916b6c1618c2ab762b@%3Cdev.airflow.apache.org%3E

> Tasks do not get backfilled sequentially
> 
>
> Key: AIRFLOW-593
> URL: https://issues.apache.org/jira/browse/AIRFLOW-593
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: DagRun, scheduler
>Affects Versions: Airflow 1.7.1.3
>Reporter: Jong Kim
>Priority: Minor
> Attachments: Screen Shot 2018-07-20 at 10.04.24 AM.png
>
>
> I need to have the tasks within a DAG complete in order when running 
> backfills. I am running on my mac locally using SequentialExecutor.
> Let's say I have a DAG running daily at 11AM UTC (0 11 * * *) with a 
> start_date: datetime(2016, 10, 20, 11, 0, 0). The DAG consists of 3 tasks, 
> which must complete in order. task0 -> task1 -> task2. This dependency is set 
> using .set_downstream().
> Today (2016/10/22) I reset the database, turn-on the DAGrun using the on/off 
> toggle in the webserver, and issue "airflow scheduler", which will 
> automatically backfill starting from start_date.
> It will backfill for 2016/10/20 and 2016/10/21.  I expect backfill to run 
> like the following sequentially:
> datetime(2016, 10, 20, 11, 0, 0) task0
> datetime(2016, 10, 20, 11, 0, 0) task1
> datetime(2016, 10, 20, 11, 0, 0) task2
> datetime(2016, 10, 21, 11, 0, 0) task0
> datetime(2016, 10, 21, 11, 0, 0) task1
> datetime(2016, 10, 21, 11, 0, 0) task2
> With 'depends_on_past': False, I see Airflow running tasks grouped by 
> sequence number something like this, which is not what I want:
> datetime(2016, 10, 20, 11, 0, 0) task0
> datetime(2016, 10, 21, 11, 0, 0) task0
> datetime(2016, 10, 20, 11, 0, 0) task1
> datetime(2016, 10, 21, 11, 0, 0) task1
> datetime(2016, 10, 20, 11, 0, 0) task2
> datetime(2016, 10, 21, 11, 0, 0) task2
> With 'depends_on_past': True and 'wait_for_downstream': True, I expect it to 
> run like what I need to, but instead it runs some tasks out of order like 
> this:
> datetime(2016, 10, 20, 11, 0, 0) task0
> datetime(2016, 10, 20, 11, 0, 0) task1
> datetime(2016, 10, 21, 11, 0, 0) task0   <- out of order!
> datetime(2016, 10, 20, 11, 0, 0) task2   <- out of order!
> datetime(2016, 10, 21, 11, 0, 0) task1
> datetime(2016, 10, 21, 11, 0, 0) task2
> Is this a bug? If not, am I understanding 'depends_on_past' and 
> 'wait_for_downstream' correctly? What do I need to do?
> The only remedy I can think of is to backfill each date manually.
> Public gist of DAG: 
> https://gist.github.com/jong-eatsa/cba1bf3c182b38e966696da47164faf1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-593) Tasks do not get backfilled sequentially

2018-07-20 Thread Chris Bandy (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16550869#comment-16550869
 ] 

Chris Bandy commented on AIRFLOW-593:
-

I'm seeing this same behavior in Airflow 1.9.0.

The `load_transactions` task has `depends_on_past=True`, but earlier instances 
are getting queued/executed after later ones during backfill. (The errors 
occurred when I killed the backfill command.)

 

!Screen Shot 2018-07-20 at 10.04.24 AM.png!

> Tasks do not get backfilled sequentially
> 
>
> Key: AIRFLOW-593
> URL: https://issues.apache.org/jira/browse/AIRFLOW-593
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: DagRun, scheduler
>Affects Versions: Airflow 1.7.1.3
>Reporter: Jong Kim
>Priority: Minor
> Attachments: Screen Shot 2018-07-20 at 10.04.24 AM.png
>
>
> I need to have the tasks within a DAG complete in order when running 
> backfills. I am running on my mac locally using SequentialExecutor.
> Let's say I have a DAG running daily at 11AM UTC (0 11 * * *) with a 
> start_date: datetime(2016, 10, 20, 11, 0, 0). The DAG consists of 3 tasks, 
> which must complete in order. task0 -> task1 -> task2. This dependency is set 
> using .set_downstream().
> Today (2016/10/22) I reset the database, turn-on the DAGrun using the on/off 
> toggle in the webserver, and issue "airflow scheduler", which will 
> automatically backfill starting from start_date.
> It will backfill for 2016/10/20 and 2016/10/21.  I expect backfill to run 
> like the following sequentially:
> datetime(2016, 10, 20, 11, 0, 0) task0
> datetime(2016, 10, 20, 11, 0, 0) task1
> datetime(2016, 10, 20, 11, 0, 0) task2
> datetime(2016, 10, 21, 11, 0, 0) task0
> datetime(2016, 10, 21, 11, 0, 0) task1
> datetime(2016, 10, 21, 11, 0, 0) task2
> With 'depends_on_past': False, I see Airflow running tasks grouped by 
> sequence number something like this, which is not what I want:
> datetime(2016, 10, 20, 11, 0, 0) task0
> datetime(2016, 10, 21, 11, 0, 0) task0
> datetime(2016, 10, 20, 11, 0, 0) task1
> datetime(2016, 10, 21, 11, 0, 0) task1
> datetime(2016, 10, 20, 11, 0, 0) task2
> datetime(2016, 10, 21, 11, 0, 0) task2
> With 'depends_on_past': True and 'wait_for_downstream': True, I expect it to 
> run like what I need to, but instead it runs some tasks out of order like 
> this:
> datetime(2016, 10, 20, 11, 0, 0) task0
> datetime(2016, 10, 20, 11, 0, 0) task1
> datetime(2016, 10, 21, 11, 0, 0) task0   <- out of order!
> datetime(2016, 10, 20, 11, 0, 0) task2   <- out of order!
> datetime(2016, 10, 21, 11, 0, 0) task1
> datetime(2016, 10, 21, 11, 0, 0) task2
> Is this a bug? If not, am I understanding 'depends_on_past' and 
> 'wait_for_downstream' correctly? What do I need to do?
> The only remedy I can think of is to backfill each date manually.
> Public gist of DAG: 
> https://gist.github.com/jong-eatsa/cba1bf3c182b38e966696da47164faf1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRFLOW-593) Tasks do not get backfilled sequentially

2018-07-20 Thread Chris Bandy (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bandy updated AIRFLOW-593:

Attachment: Screen Shot 2018-07-20 at 10.04.24 AM.png

> Tasks do not get backfilled sequentially
> 
>
> Key: AIRFLOW-593
> URL: https://issues.apache.org/jira/browse/AIRFLOW-593
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: DagRun, scheduler
>Affects Versions: Airflow 1.7.1.3
>Reporter: Jong Kim
>Priority: Minor
> Attachments: Screen Shot 2018-07-20 at 10.04.24 AM.png
>
>
> I need to have the tasks within a DAG complete in order when running 
> backfills. I am running on my mac locally using SequentialExecutor.
> Let's say I have a DAG running daily at 11AM UTC (0 11 * * *) with a 
> start_date: datetime(2016, 10, 20, 11, 0, 0). The DAG consists of 3 tasks, 
> which must complete in order. task0 -> task1 -> task2. This dependency is set 
> using .set_downstream().
> Today (2016/10/22) I reset the database, turn-on the DAGrun using the on/off 
> toggle in the webserver, and issue "airflow scheduler", which will 
> automatically backfill starting from start_date.
> It will backfill for 2016/10/20 and 2016/10/21.  I expect backfill to run 
> like the following sequentially:
> datetime(2016, 10, 20, 11, 0, 0) task0
> datetime(2016, 10, 20, 11, 0, 0) task1
> datetime(2016, 10, 20, 11, 0, 0) task2
> datetime(2016, 10, 21, 11, 0, 0) task0
> datetime(2016, 10, 21, 11, 0, 0) task1
> datetime(2016, 10, 21, 11, 0, 0) task2
> With 'depends_on_past': False, I see Airflow running tasks grouped by 
> sequence number something like this, which is not what I want:
> datetime(2016, 10, 20, 11, 0, 0) task0
> datetime(2016, 10, 21, 11, 0, 0) task0
> datetime(2016, 10, 20, 11, 0, 0) task1
> datetime(2016, 10, 21, 11, 0, 0) task1
> datetime(2016, 10, 20, 11, 0, 0) task2
> datetime(2016, 10, 21, 11, 0, 0) task2
> With 'depends_on_past': True and 'wait_for_downstream': True, I expect it to 
> run like what I need to, but instead it runs some tasks out of order like 
> this:
> datetime(2016, 10, 20, 11, 0, 0) task0
> datetime(2016, 10, 20, 11, 0, 0) task1
> datetime(2016, 10, 21, 11, 0, 0) task0   <- out of order!
> datetime(2016, 10, 20, 11, 0, 0) task2   <- out of order!
> datetime(2016, 10, 21, 11, 0, 0) task1
> datetime(2016, 10, 21, 11, 0, 0) task2
> Is this a bug? If not, am I understanding 'depends_on_past' and 
> 'wait_for_downstream' correctly? What do I need to do?
> The only remedy I can think of is to backfill each date manually.
> Public gist of DAG: 
> https://gist.github.com/jong-eatsa/cba1bf3c182b38e966696da47164faf1



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2143) Try number displays incorrect values in the web UI

2018-05-23 Thread Chris Bandy (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2143?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16487329#comment-16487329
 ] 

Chris Bandy commented on AIRFLOW-2143:
--

Introduced by AIRFLOW-1873, I expect.

https://github.com/apache/incubator-airflow/commit/f205fae9abdba271c1eaecdf1c9db950154a8199

> Try number displays incorrect values in the web UI
> --
>
> Key: AIRFLOW-2143
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2143
> Project: Apache Airflow
>  Issue Type: Bug
>Affects Versions: 1.9.0
>Reporter: James Davidheiser
>Priority: Minor
> Attachments: adhoc_query.png, task_instance_page.png
>
>
> This was confusing us a lot in our task runs - in the database, a task that 
> ran is marked as 1 try.  However, when we view it in the UI, it shows at 2 
> tries in several places.  These include:
>  * Task Instance Details (ie 
> [https://airflow/task?execution_date=xxx&dag_id=xxx&task_id=xxx 
> )|https://airflow/task?execution_date=xxx&dag_id=xxx&task_id=xxx]
>  * Task instance browser (/admin/taskinstance/)
>  * Task Tries graph (/admin/airflow/tries)
> Notably, is is correctly shown as 1 try in the log filenames, on the log 
> viewer page (admin/airflow/log?execution_date=), and some other places.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2390) FlaskWTFDeprecationWarning: "flask_wtf.Form" has been renamed

2018-04-28 Thread Chris Bandy (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16457664#comment-16457664
 ] 

Chris Bandy commented on AIRFLOW-2390:
--

If I understand correctly, this can be remedied by renaming one import here:

https://github.com/apache/incubator-airflow/blob/1.9.0/airflow/www/forms.py#L23

> FlaskWTFDeprecationWarning: "flask_wtf.Form" has been renamed
> -
>
> Key: AIRFLOW-2390
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2390
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: webserver
>Affects Versions: 1.9.0
>Reporter: Chris Bandy
>Priority: Trivial
>
> Webserver complains about Flask deprecation:
> {noformat}
> /usr/local/lib/python3.5/dist-packages/airflow/www/views.py:661: 
> FlaskWTFDeprecationWarning: "flask_wtf.Form" has been renamed to "FlaskForm" 
> and will be removed in 1.0.
> form = DateTimeForm(data={'execution_date': dttm}){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AIRFLOW-2390) FlaskWTFDeprecationWarning: "flask_wtf.Form" has been renamed

2018-04-28 Thread Chris Bandy (JIRA)
Chris Bandy created AIRFLOW-2390:


 Summary: FlaskWTFDeprecationWarning: "flask_wtf.Form" has been 
renamed
 Key: AIRFLOW-2390
 URL: https://issues.apache.org/jira/browse/AIRFLOW-2390
 Project: Apache Airflow
  Issue Type: Improvement
  Components: webserver
Affects Versions: 1.9.0
Reporter: Chris Bandy


Webserver complains about Flask deprecation:
{noformat}
/usr/local/lib/python3.5/dist-packages/airflow/www/views.py:661: 
FlaskWTFDeprecationWarning: "flask_wtf.Form" has been renamed to "FlaskForm" 
and will be removed in 1.0.
form = DateTimeForm(data={'execution_date': dttm}){noformat}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (AIRFLOW-2128) 'Tall' DAGs scale worse than 'wide' DAGs

2018-04-07 Thread Chris Bandy (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429355#comment-16429355
 ] 

Chris Bandy edited comment on AIRFLOW-2128 at 4/7/18 1:21 PM:
--

[~szmate1618] what is your {{scheduler.min_file_process_interval}} (or 
{{AIRFLOW_\_SCHEDULER__MIN_FILE_PROCESS_INTERVAL}} environment) set to?


was (Author: cbandy):
[~szmate1618] what is your {{scheduler.min_file_process_interval}} (or 
{{AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL}} environment) set to?

> 'Tall' DAGs scale worse than 'wide' DAGs
> 
>
> Key: AIRFLOW-2128
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2128
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: DAG, DagRun, scheduler
>Affects Versions: 1.9.0
>Reporter: Máté Szabó
>Priority: Major
>  Labels: performance, usability
> Attachments: tall_dag.py, wide_dag.py
>
>
> Tall DAG = a DAG with long chains of dependencies, e.g.: 0 -> 1 -> 2 -> ... 
> -> 998 -> 999
>  Wide DAG = a DAG with many short, parallel dependencies e.g. 0 -> 1; 0 -> 2; 
> ... 0 -> 999
> Take a super simple case where both graphs are of 1000 tasks, and all the 
> tasks are just "sleep 0.03" bash commands (see the attached files).
>  With the default SequentialExecutor (without paralellism), I would expect my 
> 2 example DAGs to take (approximately) the same time to run, but apparently 
> this is not the case.
> For the wide DAG it was about 80 successfully executed tasks in 10 minutes, 
> for the tall one it was 0.
> This anomaly also seem to affect the web UI. Opening up the graph view or the 
> tree view for the wide DAG takes about 6 seconds on my machine, but for the 
> tall one it takes significantly longer, in fact currently it does not load at 
> all.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2128) 'Tall' DAGs scale worse than 'wide' DAGs

2018-04-07 Thread Chris Bandy (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2128?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16429355#comment-16429355
 ] 

Chris Bandy commented on AIRFLOW-2128:
--

[~szmate1618] what is your {{scheduler.min_file_process_interval}} (or 
{{AIRFLOW__SCHEDULER__MIN_FILE_PROCESS_INTERVAL}} environment) set to?

> 'Tall' DAGs scale worse than 'wide' DAGs
> 
>
> Key: AIRFLOW-2128
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2128
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: DAG, DagRun, scheduler
>Affects Versions: 1.9.0
>Reporter: Máté Szabó
>Priority: Major
>  Labels: performance, usability
> Attachments: tall_dag.py, wide_dag.py
>
>
> Tall DAG = a DAG with long chains of dependencies, e.g.: 0 -> 1 -> 2 -> ... 
> -> 998 -> 999
>  Wide DAG = a DAG with many short, parallel dependencies e.g. 0 -> 1; 0 -> 2; 
> ... 0 -> 999
> Take a super simple case where both graphs are of 1000 tasks, and all the 
> tasks are just "sleep 0.03" bash commands (see the attached files).
>  With the default SequentialExecutor (without paralellism), I would expect my 
> 2 example DAGs to take (approximately) the same time to run, but apparently 
> this is not the case.
> For the wide DAG it was about 80 successfully executed tasks in 10 minutes, 
> for the tall one it was 0.
> This anomaly also seem to affect the web UI. Opening up the graph view or the 
> tree view for the wide DAG takes about 6 seconds on my machine, but for the 
> tall one it takes significantly longer, in fact currently it does not load at 
> all.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-978) subdags concurrency setting is not working

2018-02-20 Thread Chris Bandy (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-978?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16370935#comment-16370935
 ] 

Chris Bandy commented on AIRFLOW-978:
-

[~jeffliujing] what is your {{core.parallelism}} set to?

> subdags concurrency setting is not working
> --
>
> Key: AIRFLOW-978
> URL: https://issues.apache.org/jira/browse/AIRFLOW-978
> Project: Apache Airflow
>  Issue Type: Bug
>Reporter: Jeff Liu
>Priority: Major
>
> I have a dag with one subdag, inside this one subdag ( level2), there are 
> more than 100 subdags. 
> It seems that the concurrency settings on the level2 subdag doesn't work as 
> expected. With concurrency setting at 12 on the level2 subdag, the subdag 
> only runs 4 concurrent jobs.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRFLOW-2108) BashOperator discards process indentation

2018-02-15 Thread Chris Bandy (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bandy updated AIRFLOW-2108:
-
Description: 
When the BashOperator logs every line of output from the executing process, it 
strips leading whitespace which makes it difficult to interpret output that was 
formatted with indentation.

For example, I'm executing [PGLoader|http://pgloader.readthedocs.io/] through 
this operator. When it finishes, it prints a summary which appears in the logs 
like so:
{noformat}
[2018-02-14 07:31:44,524] {bash_operator.py:101} INFO - 
2018-02-14T07:31:44.115000Z LOG report summary reset
[2018-02-14 07:31:44,564] {bash_operator.py:101} INFO - table name errors   
read   imported  bytes  total time   read  write
[2018-02-14 07:31:44,564] {bash_operator.py:101} INFO - --  
-  -  -  -  --  -  -
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - fetch meta data 
 0524524 1.438s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create Schemas  
0  0  0 0.161s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create SQL Types
  0 19 1920.413s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create tables  
0310310   3m2.316s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Set Table OIDs  
0155155 0.458s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - --  
-  -  -  -  --  -  -
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - --  
-  -  -  -  --  -  -
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Index Build Completion  
0353353  1m37.323s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create Indexes  
0353353  3m25.929s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Reset Sequences 
 0  0  0 2.677s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Primary Keys  0 
   147147  1m21.091s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create Foreign Keys 
 0 16 16 8.283s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create Triggers 
 0  0  0 0.339s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Install Comments
  0  0  0 0.000s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - --  
-  -  -  -  --  -  -
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Total import time   
   ✓  0  0  6m35.642s
{noformat}
Ideally, the leading whitespace would be retained, so the logs look like this:
{noformat}
[2018-02-14 07:31:44,524] {bash_operator.py:101} INFO - 
2018-02-14T07:31:44.115000Z LOG report summary reset
[2018-02-14 07:31:44,564] {bash_operator.py:101} INFO - table name  
   errors   read   imported  bytes  total time   read  write
[2018-02-14 07:31:44,564] {bash_operator.py:101} INFO - --  
-  -  -  -  --  -  -
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO -fetch meta data  
0524524 1.438s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create Schemas  
0  0  0 0.161s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO -   Create SQL Types  
0 19 1920.413s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO -  Create tables  
0310310   3m2.316s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Set Table OIDs  
0155155 0.458s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - --  
-  -  -  -  --  -  -
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - --  
-  -  -  -  --  -  -
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Index Build Completion  
0353353  1m37.323s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create 

[jira] [Commented] (AIRFLOW-2108) BashOperator discards process indentation

2018-02-14 Thread Chris Bandy (JIRA)

[ 
https://issues.apache.org/jira/browse/AIRFLOW-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16364816#comment-16364816
 ] 

Chris Bandy commented on AIRFLOW-2108:
--

If I understand correctly, this could be fixed by replacing {{line.strip()}} 
with {{line.rstrip()}}.

> BashOperator discards process indentation
> -
>
> Key: AIRFLOW-2108
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2108
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: operators
>Affects Versions: 1.9.0
>Reporter: Chris Bandy
>Priority: Minor
>
> When the BashOperator logs every line of output from the executing process, 
> it strips leading whitespace which makes it difficult to interpret output 
> that was formatted with indentation.
> For example, I'm executing [PGLoader|http://pgloader.readthedocs.io/] through 
> this operator. When it finishes, it prints a summary which appears in the 
> logs like so:
> {noformat}
> [2018-02-14 07:31:44,524] {bash_operator.py:101} INFO - 
> 2018-02-14T07:31:44.115000Z LOG report summary reset
> [2018-02-14 07:31:44,564] {bash_operator.py:101} INFO - table name errors 
>   read   imported  bytes  total time   read  write
> [2018-02-14 07:31:44,564] {bash_operator.py:101} INFO - 
> --  -  -  -  -  
> --  -  -
> [2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - fetch meta data   
>0524524 1.438s
> [2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create Schemas
>   0  0  0 0.161s
> [2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create SQL Types  
> 0 19 1920.413s
> [2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create tables 
>  0310310   3m2.316s
> [2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Set Table OIDs
>   0155155 0.458s
> [2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - 
> --  -  -  -  -  
> --  -  -
> [2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - 
> --  -  -  -  -  
> --  -  -
> [2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Index Build 
> Completion  0353353  1m37.323s
> [2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create Indexes
>   0353353  3m25.929s
> [2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Reset Sequences   
>0  0  0 2.677s
> [2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Primary Keys  
> 0147147  1m21.091s
> [2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create Foreign Keys   
>0 16 16 8.283s
> [2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create Triggers   
>0  0  0 0.339s
> [2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Install Comments  
> 0  0  0 0.000s
> [2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - 
> --  -  -  -  -  
> --  -  -
> [2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Total import time 
>  ∞  0  0  6m35.642s
> {noformat}
> Ideally, the leading whitespace would be retained, so the logs look like this:
> {noformat}
> [2018-02-14 07:31:44,524] {bash_operator.py:101} INFO - 
> 2018-02-14T07:31:44.115000Z LOG report summary reset
> [2018-02-14 07:31:44,564] {bash_operator.py:101} INFO - table 
> name errors   read   imported  bytes  total time   read   
>write
> [2018-02-14 07:31:44,564] {bash_operator.py:101} INFO - 
> --  -  -  -  -  
> --  -  -
> [2018-02-14 07:31:44,567] {bash_operator.py:101} INFO -fetch meta 
> data  0524524 1.438s
> [2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create 
> Schemas  0  0  0 0.161s
> [2018-02-14 07:31:44,567] {bash_operator.py:101} INFO -   Create SQL 
> Types  0 19 1920.413s
> [2018-02-14 07:31:44,567] {bash_operator.py:101} INFO -  Create 
> tables  0310310   3m2.316

[jira] [Updated] (AIRFLOW-2108) BashOperator discards process indentation

2018-02-14 Thread Chris Bandy (JIRA)

 [ 
https://issues.apache.org/jira/browse/AIRFLOW-2108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chris Bandy updated AIRFLOW-2108:
-
Description: 
When the BashOperator logs every line of output from the executing process, it 
strips leading whitespace which makes it difficult to interpret output that was 
formatted with indentation.

For example, I'm executing [PGLoader|http://pgloader.readthedocs.io/] through 
this operator. When it finishes, it prints a summary which appears in the logs 
like so:
{noformat}
[2018-02-14 07:31:44,524] {bash_operator.py:101} INFO - 
2018-02-14T07:31:44.115000Z LOG report summary reset
[2018-02-14 07:31:44,564] {bash_operator.py:101} INFO - table name errors   
read   imported  bytes  total time   read  write
[2018-02-14 07:31:44,564] {bash_operator.py:101} INFO - --  
-  -  -  -  --  -  -
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - fetch meta data 
 0524524 1.438s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create Schemas  
0  0  0 0.161s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create SQL Types
  0 19 1920.413s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create tables  
0310310   3m2.316s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Set Table OIDs  
0155155 0.458s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - --  
-  -  -  -  --  -  -
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - --  
-  -  -  -  --  -  -
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Index Build Completion  
0353353  1m37.323s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create Indexes  
0353353  3m25.929s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Reset Sequences 
 0  0  0 2.677s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Primary Keys  0 
   147147  1m21.091s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create Foreign Keys 
 0 16 16 8.283s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create Triggers 
 0  0  0 0.339s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Install Comments
  0  0  0 0.000s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - --  
-  -  -  -  --  -  -
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Total import time   
   ∞  0  0  6m35.642s
{noformat}
Ideally, the leading whitespace would be retained, so the logs look like this:
{noformat}
[2018-02-14 07:31:44,524] {bash_operator.py:101} INFO - 
2018-02-14T07:31:44.115000Z LOG report summary reset
[2018-02-14 07:31:44,564] {bash_operator.py:101} INFO - table name  
   errors   read   imported  bytes  total time   read  write
[2018-02-14 07:31:44,564] {bash_operator.py:101} INFO - --  
-  -  -  -  --  -  -
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO -fetch meta data  
0524524 1.438s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create Schemas  
0  0  0 0.161s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO -   Create SQL Types  
0 19 1920.413s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO -  Create tables  
0310310   3m2.316s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Set Table OIDs  
0155155 0.458s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - --  
-  -  -  -  --  -  -
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - --  
-  -  -  -  --  -  -
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Index Build Completion  
0353353  1m37.323s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create 

[jira] [Created] (AIRFLOW-2108) BashOperator discards process indentation

2018-02-14 Thread Chris Bandy (JIRA)
Chris Bandy created AIRFLOW-2108:


 Summary: BashOperator discards process indentation
 Key: AIRFLOW-2108
 URL: https://issues.apache.org/jira/browse/AIRFLOW-2108
 Project: Apache Airflow
  Issue Type: Bug
  Components: operators
Affects Versions: 1.9.0
Reporter: Chris Bandy


When the BashOperator logs every line of output from the executing process, it 
strips leading whitespace which makes it difficult to interpret output that was 
formatted with indentation.

For example, I'm executing [PGLoader|http://pgloader.readthedocs.io/] through 
this operator. When it finishes, it prints a summary which appears in the logs 
like so:

{noformat}
[2018-02-14 07:31:44,524] {bash_operator.py:101} INFO - 
2018-02-14T07:31:44.115000Z LOG report summary reset
[2018-02-14 07:31:44,564] {bash_operator.py:101} INFO - table name errors   
read   imported  bytes  total time   read  write
[2018-02-14 07:31:44,564] {bash_operator.py:101} INFO - --  
-  -  -  -  --  -  -
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - fetch meta data 
 0524524 1.438s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create Schemas  
0  0  0 0.161s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create SQL Types
  0 19 1920.413s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create tables  
0310310   3m2.316s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Set Table OIDs  
0155155 0.458s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - --  
-  -  -  -  --  -  -
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - --  
-  -  -  -  --  -  -
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Index Build Completion  
0353353  1m37.323s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create Indexes  
0353353  3m25.929s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Reset Sequences 
 0  0  0 2.677s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Primary Keys  0 
   147147  1m21.091s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create Foreign Keys 
 0 16 16 8.283s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create Triggers 
 0  0  0 0.339s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Install Comments
  0  0  0 0.000s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - --  
-  -  -  -  --  -  -
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Total import time   
   ∞  0  0  6m35.642s
{noformat}

Ideally, the leading whitespace would be retained, so the output looks like 
this:

{noformat}
[2018-02-14 07:31:44,524] {bash_operator.py:101} INFO - 
2018-02-14T07:31:44.115000Z LOG report summary reset
[2018-02-14 07:31:44,564] {bash_operator.py:101} INFO - table name  
   errors   read   imported  bytes  total time   read  write
[2018-02-14 07:31:44,564] {bash_operator.py:101} INFO - --  
-  -  -  -  --  -  -
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO -fetch meta data  
0524524 1.438s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Create Schemas  
0  0  0 0.161s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO -   Create SQL Types  
0 19 1920.413s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO -  Create tables  
0310310   3m2.316s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - Set Table OIDs  
0155155 0.458s
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - --  
-  -  -  -  --  -  -
[2018-02-14 07:31:44,567] {bash_operator.py:101} INFO - --  
-  -  -  -  --  -  -
[2018-02-14 07:31