[GitHub] [airflow] potiuk commented on issue #5808: [AIRFLOW-5205] Check xml files with xmllint + Licenses

2019-08-14 Thread GitBox
potiuk commented on issue #5808:  [AIRFLOW-5205] Check xml files with xmllint + 
Licenses
URL: https://github.com/apache/airflow/pull/5808#issuecomment-521523483
 
 
   Made the PR standalone (not depending on series of PRs


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] potiuk commented on a change in pull request #5808: [AIRFLOW-5205] Check xml files with xmllint + Licenses

2019-08-14 Thread GitBox
potiuk commented on a change in pull request #5808:  [AIRFLOW-5205] Check xml 
files with xmllint + Licenses
URL: https://github.com/apache/airflow/pull/5808#discussion_r314181813
 
 

 ##
 File path: airflow/_vendor/slugify/slugify.py
 ##
 @@ -1,3 +1,6 @@
+# -*- coding: utf-8 -*-
+# pylint: skip-file
+"""Slugify !"""
 
 Review comment:
   Removed in the first commit.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] potiuk commented on issue #5807: [AIRFLOW-5204] Shellcheck + common licences in shell files

2019-08-14 Thread GitBox
potiuk commented on issue #5807:  [AIRFLOW-5204] Shellcheck + common licences 
in shell files
URL: https://github.com/apache/airflow/pull/5807#issuecomment-521519887
 
 
   Again - another set of checks. This time for shell files (shellcheck + 
shebangs/executable + licenses)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] potiuk commented on a change in pull request #5807: [AIRFLOW-5204] Shellcheck + common licence in shell files

2019-08-14 Thread GitBox
potiuk commented on a change in pull request #5807:  [AIRFLOW-5204] Shellcheck 
+ common licence in shell files
URL: https://github.com/apache/airflow/pull/5807#discussion_r314178750
 
 

 ##
 File path: airflow/example_dags/entrypoint.sh
 ##
 @@ -1,20 +1,20 @@
-# -*- coding: utf-8 -*-
+#!/usr/bin/env bash
+#  Licensed to the Apache Software Foundation (ASF) under one
+#  or more contributor license agreements.  See the NOTICE file
+#  distributed with this work for additional information
+#  regarding copyright ownership.  The ASF licenses this file
+#  to you under the Apache License, Version 2.0 (the
+#  "License"); you may not use this file except in compliance
+#  with the License.  You may obtain a copy of the License at
 #
-# Licensed to the Apache Software Foundation (ASF) under one
-# or more contributor license agreements.  See the NOTICE file
-# distributed with this work for additional information
-# regarding copyright ownership.  The ASF licenses this file
-# to you under the Apache License, Version 2.0 (the
-# "License"); you may not use this file except in compliance
-# with the License.  You may obtain a copy of the License at
+#http://www.apache.org/licenses/LICENSE-2.0
 #
-#   http://www.apache.org/licenses/LICENSE-2.0
-#
-# Unless required by applicable law or agreed to in writing,
-# software distributed under the License is distributed on an
-# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
-# KIND, either express or implied.  See the License for the
-# specific language governing permissions and limitations
-# under the License.
+#  Unless required by applicable law or agreed to in writing,
+#  software distributed under the License is distributed on an
+#  "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+#  KIND, either express or implied.  See the License for the
+#  specific language governing permissions and limitations
+#  under the License.
 
-["/bin/bash", "-c", "/bin/sleep 30; /bin/mv {{params.source_location}}/{{ 
ti.xcom_pull('view_file') }} {{params.target_location}}; /bin/echo 
'{{params.target_location}}/{{ ti.xcom_pull('view_file') }}';"]
+# TODO: Uncomment this code when we start using it
+#[ "/bin/bash", "-c", "/bin/sleep 30; /bin/mv {{params.source_location}}/{{ 
ti.xcom_pull('view_file') }} {{params.target_location}}; /bin/echo 
'{{params.target_location}}/{{ ti.xcom_pull('view_file') }}';" ]  # shellcheck 
disable=SC1073,SC1072,SC1035
 
 Review comment:
   This is a problematic implementation of DockerOperator w/regards to command. 
The command can be either a string or array. It can be templated and it can 
also ba a file with .bash or .sh extension. In this case the python array was 
stored in a file with .sh extension - that was valid from the DockerOPerator 
point of view (see docker_copy_data.py) but it makes little sense to store an 
array in .sh file. Those tests in docker_copy_data.py were anyhow commented out 
with suggestion to uncomment if you want to run your own testing.
   
   Rather than commenting out I simply moved the array to docker_copy_data.py 
and removed the entrypoint.sh now


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency

2019-08-14 Thread Darren Weber (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darren Weber updated AIRFLOW-5218:
--
Description: 
The AWS Batch Operator attempts to use a boto3 feature that is not available 
and has not been merged in years, see
 - [https://github.com/boto/botocore/pull/1307]
 - see also [https://github.com/broadinstitute/cromwell/issues/4303]

This is a curious case of premature optimization. So, in the meantime, this 
means that the fallback is the exponential backoff routine for the status 
checks on the batch job. Unfortunately, when the concurrency of Airflow jobs is 
very high (100's of tasks), this fallback polling hits the AWS Batch API too 
hard and the AWS API throttle throws an error, which fails the Airflow task, 
simply because the status is polled too frequently.

Check the output from the retry algorithm, e.g. within the first 10 retries, 
the status of an AWS batch job is checked about 10 times at a rate that is 
approx 1 retry/sec. When an Airflow instance is running 10's or 100's of 
concurrent batch jobs, this hits the API too frequently and crashes the Airflow 
task (plus it occupies a worker in too much busy work).
{code:java}
In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] 
 Out[4]: 
 [1.0,
 1.01,
 1.04,
 1.09,
 1.1601,
 1.25,
 1.36,
 1.4902,
 1.6401,
 1.81,
 2.0,
 2.21,
 2.4404,
 2.6904,
 2.9604,
 3.25,
 3.5605,
 3.8906,
 4.24,
 4.61]{code}
Possible solutions are to introduce an initial sleep (say 60 sec?) right after 
issuing the request, so that the batch job has some time to spin up. The job 
progresses through a through phases before it gets to RUNNING state and polling 
for each phase of that sequence might help. Since batch jobs tend to be 
long-running jobs (rather than near-real time jobs), it might help to issue 
less frequent polls when it's in the RUNNING state. Something on the order of 
10's seconds might be reasonable for batch jobs? Maybe the class could expose a 
parameter for the rate of polling (or a callable)?

 

Another option is to use something like the sensor-poke approach, with 
rescheduling, e.g.

- 
[https://github.com/apache/airflow/blob/master/airflow/sensors/base_sensor_operator.py#L117]

 

  was:
The AWS Batch Operator attempts to use a boto3 feature that is not available 
and has not been merged in years, see
 - [https://github.com/boto/botocore/pull/1307]
 - see also [https://github.com/broadinstitute/cromwell/issues/4303]

This is a curious case of premature optimization. So, in the meantime, this 
means that the fallback is the exponential backoff routine for the status 
checks on the batch job. Unfortunately, when the concurrency of Airflow jobs is 
very high (100's of tasks), this fallback polling hits the AWS Batch API too 
hard and the AWS API throttle throws an error, which fails the Airflow task, 
simply because the status is polled too frequently.

Check the output from the retry algorithm, e.g. within the first 10 retries, 
the status of an AWS batch job is checked about 10 times at a rate that is 
approx 1 retry/sec. When an Airflow instance is running 10's or 100's of 
concurrent batch jobs, this hits the API too frequently and crashes the Airflow 
task (plus it occupies a worker in too much busy work).
{code:java}
In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] 
 Out[4]: 
 [1.0,
 1.01,
 1.04,
 1.09,
 1.1601,
 1.25,
 1.36,
 1.4902,
 1.6401,
 1.81,
 2.0,
 2.21,
 2.4404,
 2.6904,
 2.9604,
 3.25,
 3.5605,
 3.8906,
 4.24,
 4.61]{code}
Possible solutions are to introduce an initial sleep (say 60 sec?) right after 
issuing the request, so that the batch job has some time to spin up. The job 
progresses through a through phases before it gets to RUNNING state and polling 
for each phase of that sequence might help. Since batch jobs tend to be 
long-running jobs (rather than near-real time jobs), it might help to issue 
less frequent polls when it's in the RUNNING state. Something on the order of 
10's seconds might be reasonable for batch jobs? Maybe the class could expose a 
parameter for the rate of polling (or a callable)?


> AWS Batch Operator - status polling too often, esp. for high concurrency
> 
>
> Key: AIRFLOW-5218
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5218
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: aws, contrib
>Affects Versions: 1.10.4
>Reporter: Darren Weber
>Assignee: Darren Weber
>Priority: Major
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available 
> and has not been merged 

[GitHub] [airflow] potiuk commented on issue #5790: [AIRFLOW-5180] Added static checks (yamllint) + auto-licences for yaml

2019-08-14 Thread GitBox
potiuk commented on issue #5790:  [AIRFLOW-5180] Added static checks (yamllint) 
+ auto-licences for yaml
URL: https://github.com/apache/airflow/pull/5790#issuecomment-521513998
 
 
   Part of static checks dealing with yaml (yamllint + consistent licenses). 
Removed the chaing of depnding commits.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Resolved] (AIRFLOW-5161) Add pre-commit hooks to run static checks for only changed files

2019-08-14 Thread Jarek Potiuk (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jarek Potiuk resolved AIRFLOW-5161.
---
   Resolution: Fixed
Fix Version/s: 1.10.5

> Add pre-commit hooks to run static checks for only changed files
> 
>
> Key: AIRFLOW-5161
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5161
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: ci
>Affects Versions: 2.0.0
>Reporter: Jarek Potiuk
>Priority: Major
> Fix For: 1.10.5
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (AIRFLOW-5161) Add pre-commit hooks to run static checks for only changed files

2019-08-14 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907824#comment-16907824
 ] 

ASF subversion and git services commented on AIRFLOW-5161:
--

Commit df4dc31ea109b4a6b832a9d6b3a4d54e1efd6e5a in airflow's branch 
refs/heads/v1-10-test from Jarek Potiuk
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=df4dc31 ]

[AIRFLOW-5161] Static checks are run automatically in pre-commit hooks (#5777)


> Add pre-commit hooks to run static checks for only changed files
> 
>
> Key: AIRFLOW-5161
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5161
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: ci
>Affects Versions: 2.0.0
>Reporter: Jarek Potiuk
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[GitHub] [airflow] potiuk commented on issue #5786: [AIRFLOW-5170] Fix encoding pragmas, consistent licences for python files and related pylint fixes

2019-08-14 Thread GitBox
potiuk commented on issue #5786:  [AIRFLOW-5170] Fix encoding pragmas, 
consistent licences for python files and related pylint fixes
URL: https://github.com/apache/airflow/pull/5786#issuecomment-521509287
 
 
   @ashb @dimberman @Fokko  -> this is the first additional set of checks (for 
python files) added after merging the pylint/mypy/flake checks in pre-commit. 
It will make our python code much more consistent (and fixes/disables a lot of 
pylint errors). We also have a script that can refresh pylint_todo.txt


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] derrick-mink-sp opened a new pull request #5826: Sailpoint internal/pod aliases

2019-08-14 Thread GitBox
derrick-mink-sp opened a new pull request #5826: Sailpoint internal/pod aliases
URL: https://github.com/apache/airflow/pull/5826
 
 
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [X] My PR addresses the following [Airflow Jira]
 - https://issues.apache.org/jira/browse/AIRFLOW-5221
   
   ### Description
   
   - [X] Here are some details about my PR, including screenshots of any UI 
changes:
 - This PR will give users the ability to add DNS entries to their 
Kubernetes pods via hostAliases 
   ### Tests
   
   - [X] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
 tests/minikube/test_kubernetes_pod_operator.py
- test_host_aliases
   ### Commits
   
   - [] My commits all reference Jira issues in their subject lines, and I have 
squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [X] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   
   ### Code Quality
   
   - [ ] Passes `flake8`
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (AIRFLOW-5221) Add host alias support to the KubernetesPodOperator

2019-08-14 Thread Derrick Mink (JIRA)
Derrick Mink created AIRFLOW-5221:
-

 Summary: Add host alias support to the KubernetesPodOperator
 Key: AIRFLOW-5221
 URL: https://issues.apache.org/jira/browse/AIRFLOW-5221
 Project: Apache Airflow
  Issue Type: Improvement
  Components: operators
Affects Versions: 1.10.4
Reporter: Derrick Mink
Assignee: Derrick Mink


[https://kubernetes.io/docs/concepts/services-networking/add-entries-to-pod-etc-hosts-with-host-aliases/]

The only wait to manage DNS entries for kubernetes pods is through hosts 
aliases 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (AIRFLOW-5220) Easy form to create airflow dags

2019-08-14 Thread huangyan (JIRA)
huangyan created AIRFLOW-5220:
-

 Summary: Easy form to create airflow dags
 Key: AIRFLOW-5220
 URL: https://issues.apache.org/jira/browse/AIRFLOW-5220
 Project: Apache Airflow
  Issue Type: New Feature
  Components: DAG, database
Affects Versions: 1.10.5
Reporter: huangyan
Assignee: huangyan


The airflow usage threshold is higher and the user must write a Python dag 
file. However, many users don't write Python, they want to create dags directly 
from forms.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (AIRFLOW-5219) Alarm if the task is not executed within the expected time range.

2019-08-14 Thread huangyan (JIRA)
huangyan created AIRFLOW-5219:
-

 Summary: Alarm if the task is not executed within the expected 
time range.
 Key: AIRFLOW-5219
 URL: https://issues.apache.org/jira/browse/AIRFLOW-5219
 Project: Apache Airflow
  Issue Type: New Feature
  Components: DAG
Affects Versions: 1.10.4
Reporter: huangyan
Assignee: huangyan
 Fix For: 1.10.4


When using Airflow, user has an expected time range for the task. Beyond this 
range, the user expects to get an alert instead of performing the task 
directly. 

They may not want the task to be executed automatically, and then manually 
perform the task after analyzing the cause.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency

2019-08-14 Thread Darren Weber (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907774#comment-16907774
 ] 

Darren Weber commented on AIRFLOW-5218:
---

There is something weird in the polling logs.  The timestamps in the logs 
indicate that the retry polling interval is not what it says it will be, e.g. 
it reports the retry attempt count as the number of seconds (it's not).
{noformat}
[2019-08-15 02:33:57,163] {awsbatch_operator.py:103} INFO - AWS Batch Job 
started: ...
[2019-08-15 02:33:57,166] {awsbatch_operator.py:137} INFO - AWS Batch retry in 
the next 0 seconds
[2019-08-15 02:33:58,284] {awsbatch_operator.py:137} INFO - AWS Batch retry in 
the next 1 seconds
[2019-08-15 02:33:59,412] {awsbatch_operator.py:137} INFO - AWS Batch retry in 
the next 2 seconds 
[2019-08-15 02:34:00,568] {awsbatch_operator.py:137} INFO - AWS Batch retry in 
the next 3 seconds 
[2019-08-15 02:34:01,866] {awsbatch_operator.py:137} INFO - AWS Batch retry in 
the next 4 seconds 
[2019-08-15 02:34:03,140] {awsbatch_operator.py:137} INFO - AWS Batch retry in 
the next 5 seconds 
[2019-08-15 02:34:04,695] {awsbatch_operator.py:137} INFO - AWS Batch retry in 
the next 6 seconds 
[2019-08-15 02:34:06,165] {awsbatch_operator.py:137} INFO - AWS Batch retry in 
the next 7 seconds 
[2019-08-15 02:34:07,764] {awsbatch_operator.py:137} INFO - AWS Batch retry in 
the next 8 seconds 
[2019-08-15 02:34:09,514] {awsbatch_operator.py:137} INFO - AWS Batch retry in 
the next 9 seconds
[2019-08-15 02:34:11,440] {awsbatch_operator.py:137} INFO - AWS Batch retry in 
the next 10 seconds
{noformat}

> AWS Batch Operator - status polling too often, esp. for high concurrency
> 
>
> Key: AIRFLOW-5218
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5218
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: aws, contrib
>Affects Versions: 1.10.4
>Reporter: Darren Weber
>Assignee: Darren Weber
>Priority: Major
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available 
> and has not been merged in years, see
>  - [https://github.com/boto/botocore/pull/1307]
>  - see also [https://github.com/broadinstitute/cromwell/issues/4303]
> This is a curious case of premature optimization. So, in the meantime, this 
> means that the fallback is the exponential backoff routine for the status 
> checks on the batch job. Unfortunately, when the concurrency of Airflow jobs 
> is very high (100's of tasks), this fallback polling hits the AWS Batch API 
> too hard and the AWS API throttle throws an error, which fails the Airflow 
> task, simply because the status is polled too frequently.
> Check the output from the retry algorithm, e.g. within the first 10 retries, 
> the status of an AWS batch job is checked about 10 times at a rate that is 
> approx 1 retry/sec. When an Airflow instance is running 10's or 100's of 
> concurrent batch jobs, this hits the API too frequently and crashes the 
> Airflow task (plus it occupies a worker in too much busy work).
> {code:java}
> In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] 
>  Out[4]: 
>  [1.0,
>  1.01,
>  1.04,
>  1.09,
>  1.1601,
>  1.25,
>  1.36,
>  1.4902,
>  1.6401,
>  1.81,
>  2.0,
>  2.21,
>  2.4404,
>  2.6904,
>  2.9604,
>  3.25,
>  3.5605,
>  3.8906,
>  4.24,
>  4.61]{code}
> Possible solutions are to introduce an initial sleep (say 60 sec?) right 
> after issuing the request, so that the batch job has some time to spin up. 
> The job progresses through a through phases before it gets to RUNNING state 
> and polling for each phase of that sequence might help. Since batch jobs tend 
> to be long-running jobs (rather than near-real time jobs), it might help to 
> issue less frequent polls when it's in the RUNNING state. Something on the 
> order of 10's seconds might be reasonable for batch jobs? Maybe the class 
> could expose a parameter for the rate of polling (or a callable)?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency

2019-08-14 Thread Darren Weber (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darren Weber reassigned AIRFLOW-5218:
-

Assignee: Darren Weber

> AWS Batch Operator - status polling too often, esp. for high concurrency
> 
>
> Key: AIRFLOW-5218
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5218
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: aws, contrib
>Affects Versions: 1.10.4
>Reporter: Darren Weber
>Assignee: Darren Weber
>Priority: Major
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available 
> and has not been merged in years, see
>  - [https://github.com/boto/botocore/pull/1307]
>  - see also [https://github.com/broadinstitute/cromwell/issues/4303]
> This is a curious case of premature optimization. So, in the meantime, this 
> means that the fallback is the exponential backoff routine for the status 
> checks on the batch job. Unfortunately, when the concurrency of Airflow jobs 
> is very high (100's of tasks), this fallback polling hits the AWS Batch API 
> too hard and the AWS API throttle throws an error, which fails the Airflow 
> task, simply because the status is polled too frequently.
> Check the output from the retry algorithm, e.g. within the first 10 retries, 
> the status of an AWS batch job is checked about 10 times at a rate that is 
> approx 1 retry/sec. When an Airflow instance is running 10's or 100's of 
> concurrent batch jobs, this hits the API too frequently and crashes the 
> Airflow task (plus it occupies a worker in too much busy work).
> {code:java}
> In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] 
>  Out[4]: 
>  [1.0,
>  1.01,
>  1.04,
>  1.09,
>  1.1601,
>  1.25,
>  1.36,
>  1.4902,
>  1.6401,
>  1.81,
>  2.0,
>  2.21,
>  2.4404,
>  2.6904,
>  2.9604,
>  3.25,
>  3.5605,
>  3.8906,
>  4.24,
>  4.61]{code}
> Possible solutions are to introduce an initial sleep (say 60 sec?) right 
> after issuing the request, so that the batch job has some time to spin up. 
> The job progresses through a through phases before it gets to RUNNING state 
> and polling for each phase of that sequence might help. Since batch jobs tend 
> to be long-running jobs (rather than near-real time jobs), it might help to 
> issue less frequent polls when it's in the RUNNING state. Something on the 
> order of 10's seconds might be reasonable for batch jobs? Maybe the class 
> could expose a parameter for the rate of polling (or a callable)?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Comment Edited] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency

2019-08-14 Thread Darren Weber (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907749#comment-16907749
 ] 

Darren Weber edited comment on AIRFLOW-5218 at 8/15/19 2:15 AM:


PR at [https://github.com/apache/airflow/pull/5825] applies the following 
suggestion.

Even bumping the backoff factor from `0.1` to `0.3` might help, e.g.
{code:java}
from datetime import datetime
from time import sleep

for retries in range(10):
pause = 1 + pow(retries * 0.3, 2)
print(f"{datetime.now()}: retry ({retries:04d}) sleeping for {pause:6.2f} 
sec")
sleep(pause)

2019-08-14 19:02:58.745923: retry () sleeping for 1.00 sec
2019-08-14 19:02:59.747635: retry (0001) sleeping for 1.09 sec
2019-08-14 19:03:00.840129: retry (0002) sleeping for 1.36 sec
2019-08-14 19:03:02.202734: retry (0003) sleeping for 1.81 sec
2019-08-14 19:03:04.015686: retry (0004) sleeping for 2.44 sec
2019-08-14 19:03:06.458972: retry (0005) sleeping for 3.25 sec
2019-08-14 19:03:09.713452: retry (0006) sleeping for 4.24 sec
2019-08-14 19:03:13.954253: retry (0007) sleeping for 5.41 sec
2019-08-14 19:03:19.368445: retry (0008) sleeping for 6.76 sec
2019-08-14 19:03:26.135600: retry (0009) sleeping for 8.29 sec

{code}


was (Author: dazza):
Even bumping the backoff factor from `0.1` to `0.3` might help, e.g.
{code:java}
from datetime import datetime
from time import sleep

for retries in range(10):
pause = 1 + pow(retries * 0.3, 2)
print(f"{datetime.now()}: retry ({retries:04d}) sleeping for {pause:6.2f} 
sec")
sleep(pause)

2019-08-14 19:02:58.745923: retry () sleeping for 1.00 sec
2019-08-14 19:02:59.747635: retry (0001) sleeping for 1.09 sec
2019-08-14 19:03:00.840129: retry (0002) sleeping for 1.36 sec
2019-08-14 19:03:02.202734: retry (0003) sleeping for 1.81 sec
2019-08-14 19:03:04.015686: retry (0004) sleeping for 2.44 sec
2019-08-14 19:03:06.458972: retry (0005) sleeping for 3.25 sec
2019-08-14 19:03:09.713452: retry (0006) sleeping for 4.24 sec
2019-08-14 19:03:13.954253: retry (0007) sleeping for 5.41 sec
2019-08-14 19:03:19.368445: retry (0008) sleeping for 6.76 sec
2019-08-14 19:03:26.135600: retry (0009) sleeping for 8.29 sec

{code}

> AWS Batch Operator - status polling too often, esp. for high concurrency
> 
>
> Key: AIRFLOW-5218
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5218
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: aws, contrib
>Affects Versions: 1.10.4
>Reporter: Darren Weber
>Priority: Major
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available 
> and has not been merged in years, see
>  - [https://github.com/boto/botocore/pull/1307]
>  - see also [https://github.com/broadinstitute/cromwell/issues/4303]
> This is a curious case of premature optimization. So, in the meantime, this 
> means that the fallback is the exponential backoff routine for the status 
> checks on the batch job. Unfortunately, when the concurrency of Airflow jobs 
> is very high (100's of tasks), this fallback polling hits the AWS Batch API 
> too hard and the AWS API throttle throws an error, which fails the Airflow 
> task, simply because the status is polled too frequently.
> Check the output from the retry algorithm, e.g. within the first 10 retries, 
> the status of an AWS batch job is checked about 10 times at a rate that is 
> approx 1 retry/sec. When an Airflow instance is running 10's or 100's of 
> concurrent batch jobs, this hits the API too frequently and crashes the 
> Airflow task (plus it occupies a worker in too much busy work).
> {code:java}
> In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] 
>  Out[4]: 
>  [1.0,
>  1.01,
>  1.04,
>  1.09,
>  1.1601,
>  1.25,
>  1.36,
>  1.4902,
>  1.6401,
>  1.81,
>  2.0,
>  2.21,
>  2.4404,
>  2.6904,
>  2.9604,
>  3.25,
>  3.5605,
>  3.8906,
>  4.24,
>  4.61]{code}
> Possible solutions are to introduce an initial sleep (say 60 sec?) right 
> after issuing the request, so that the batch job has some time to spin up. 
> The job progresses through a through phases before it gets to RUNNING state 
> and polling for each phase of that sequence might help. Since batch jobs tend 
> to be long-running jobs (rather than near-real time jobs), it might help to 
> issue less frequent polls when it's in the RUNNING state. Something on the 
> order of 10's seconds might be reasonable for batch jobs? Maybe the class 
> could expose a parameter for the rate of polling (or a callable)?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency

2019-08-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907762#comment-16907762
 ] 

ASF GitHub Bot commented on AIRFLOW-5218:
-

darrenleeweber commented on pull request #5825: [AIRFLOW-5218] less polling for 
AWS Batch status
URL: https://github.com/apache/airflow/pull/5825
 
 
   ### Jira
   
   - [x] My PR addresses the following [Airflow Jira]
   - https://issues.apache.org/jira/browse/AIRFLOW-5218
   
   ### Description
   
   - [x] Here are some details about my PR, including screenshots of any UI 
changes:
   - a small increase in the backoff factor could avoid excessive polling
   - avoid the AWS API throttle limits for highly concurrent tasks
   
   ### Tests
   
   - [ ] My PR does not need testing for this extremely good reason:
   - it's the smallest possible change that might address the issue
   - the change does not impact any public API
   - if there are tests on the polling interval (or should be), LMK
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines
   - it's just one commit
   - the commit message is succinct, LMK if you want it amended
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - no changes required to documentation
   
   ### Code Quality
   
   - [ ] Passes `flake8`
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> AWS Batch Operator - status polling too often, esp. for high concurrency
> 
>
> Key: AIRFLOW-5218
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5218
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: aws, contrib
>Affects Versions: 1.10.4
>Reporter: Darren Weber
>Priority: Major
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available 
> and has not been merged in years, see
>  - [https://github.com/boto/botocore/pull/1307]
>  - see also [https://github.com/broadinstitute/cromwell/issues/4303]
> This is a curious case of premature optimization. So, in the meantime, this 
> means that the fallback is the exponential backoff routine for the status 
> checks on the batch job. Unfortunately, when the concurrency of Airflow jobs 
> is very high (100's of tasks), this fallback polling hits the AWS Batch API 
> too hard and the AWS API throttle throws an error, which fails the Airflow 
> task, simply because the status is polled too frequently.
> Check the output from the retry algorithm, e.g. within the first 10 retries, 
> the status of an AWS batch job is checked about 10 times at a rate that is 
> approx 1 retry/sec. When an Airflow instance is running 10's or 100's of 
> concurrent batch jobs, this hits the API too frequently and crashes the 
> Airflow task (plus it occupies a worker in too much busy work).
> {code:java}
> In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] 
>  Out[4]: 
>  [1.0,
>  1.01,
>  1.04,
>  1.09,
>  1.1601,
>  1.25,
>  1.36,
>  1.4902,
>  1.6401,
>  1.81,
>  2.0,
>  2.21,
>  2.4404,
>  2.6904,
>  2.9604,
>  3.25,
>  3.5605,
>  3.8906,
>  4.24,
>  4.61]{code}
> Possible solutions are to introduce an initial sleep (say 60 sec?) right 
> after issuing the request, so that the batch job has some time to spin up. 
> The job progresses through a through phases before it gets to RUNNING state 
> and polling for each phase of that sequence might help. Since batch jobs tend 
> to be long-running jobs (rather than near-real time jobs), it might help to 
> issue less frequent polls when it's in the RUNNING state. Something on the 
> order of 10's seconds might be reasonable for batch jobs? Maybe the class 
> could expose a parameter for the rate of polling (or a callable)?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[GitHub] [airflow] darrenleeweber opened a new pull request #5825: [AIRFLOW-5218] less polling for AWS Batch status

2019-08-14 Thread GitBox
darrenleeweber opened a new pull request #5825: [AIRFLOW-5218] less polling for 
AWS Batch status
URL: https://github.com/apache/airflow/pull/5825
 
 
   ### Jira
   
   - [x] My PR addresses the following [Airflow Jira]
   - https://issues.apache.org/jira/browse/AIRFLOW-5218
   
   ### Description
   
   - [x] Here are some details about my PR, including screenshots of any UI 
changes:
   - a small increase in the backoff factor could avoid excessive polling
   - avoid the AWS API throttle limits for highly concurrent tasks
   
   ### Tests
   
   - [ ] My PR does not need testing for this extremely good reason:
   - it's the smallest possible change that might address the issue
   - the change does not impact any public API
   - if there are tests on the polling interval (or should be), LMK
   
   ### Commits
   
   - [x] My commits all reference Jira issues in their subject lines
   - it's just one commit
   - the commit message is succinct, LMK if you want it amended
   
   ### Documentation
   
   - [x] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - no changes required to documentation
   
   ### Code Quality
   
   - [ ] Passes `flake8`
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Comment Edited] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency

2019-08-14 Thread Darren Weber (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907749#comment-16907749
 ] 

Darren Weber edited comment on AIRFLOW-5218 at 8/15/19 2:04 AM:


Even bumping the backoff factor from `0.1` to `0.3` might help, e.g.
{code:java}
from datetime import datetime
from time import sleep

for retries in range(10):
pause = 1 + pow(retries * 0.3, 2)
print(f"{datetime.now()}: retry ({retries:04d}) sleeping for {pause:6.2f} 
sec")
sleep(pause)

2019-08-14 19:02:58.745923: retry () sleeping for 1.00 sec
2019-08-14 19:02:59.747635: retry (0001) sleeping for 1.09 sec
2019-08-14 19:03:00.840129: retry (0002) sleeping for 1.36 sec
2019-08-14 19:03:02.202734: retry (0003) sleeping for 1.81 sec
2019-08-14 19:03:04.015686: retry (0004) sleeping for 2.44 sec
2019-08-14 19:03:06.458972: retry (0005) sleeping for 3.25 sec
2019-08-14 19:03:09.713452: retry (0006) sleeping for 4.24 sec
2019-08-14 19:03:13.954253: retry (0007) sleeping for 5.41 sec
2019-08-14 19:03:19.368445: retry (0008) sleeping for 6.76 sec
2019-08-14 19:03:26.135600: retry (0009) sleeping for 8.29 sec

{code}


was (Author: dazza):
Even bumping the backoff factor from `0.1` to `0.3` might help, e.g.
{code}
from datetime import datetime
from time import sleep

In [18]: for i in [1 + pow(retries * 0.3, 2) for retries in range(10)]: 
...: print(f"{datetime.now()}: sleeping for {i}") 
...: sleep(i) 
...:

  
2019-08-14 18:52:01.688705: sleeping for 1.0
2019-08-14 18:52:02.690385: sleeping for 1.09
2019-08-14 18:52:03.781384: sleeping for 1.3599
2019-08-14 18:52:05.144492: sleeping for 1.8098
2019-08-14 18:52:06.956547: sleeping for 2.44
2019-08-14 18:52:09.401454: sleeping for 3.25
2019-08-14 18:52:12.652212: sleeping for 4.239
2019-08-14 18:52:16.897060: sleeping for 5.41
2019-08-14 18:52:22.313692: sleeping for 6.76
2019-08-14 18:52:29.082087: sleeping for 8.29
{code}

> AWS Batch Operator - status polling too often, esp. for high concurrency
> 
>
> Key: AIRFLOW-5218
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5218
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: aws, contrib
>Affects Versions: 1.10.4
>Reporter: Darren Weber
>Priority: Major
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available 
> and has not been merged in years, see
>  - [https://github.com/boto/botocore/pull/1307]
>  - see also [https://github.com/broadinstitute/cromwell/issues/4303]
> This is a curious case of premature optimization. So, in the meantime, this 
> means that the fallback is the exponential backoff routine for the status 
> checks on the batch job. Unfortunately, when the concurrency of Airflow jobs 
> is very high (100's of tasks), this fallback polling hits the AWS Batch API 
> too hard and the AWS API throttle throws an error, which fails the Airflow 
> task, simply because the status is polled too frequently.
> Check the output from the retry algorithm, e.g. within the first 10 retries, 
> the status of an AWS batch job is checked about 10 times at a rate that is 
> approx 1 retry/sec. When an Airflow instance is running 10's or 100's of 
> concurrent batch jobs, this hits the API too frequently and crashes the 
> Airflow task (plus it occupies a worker in too much busy work).
> {code:java}
> In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] 
>  Out[4]: 
>  [1.0,
>  1.01,
>  1.04,
>  1.09,
>  1.1601,
>  1.25,
>  1.36,
>  1.4902,
>  1.6401,
>  1.81,
>  2.0,
>  2.21,
>  2.4404,
>  2.6904,
>  2.9604,
>  3.25,
>  3.5605,
>  3.8906,
>  4.24,
>  4.61]{code}
> Possible solutions are to introduce an initial sleep (say 60 sec?) right 
> after issuing the request, so that the batch job has some time to spin up. 
> The job progresses through a through phases before it gets to RUNNING state 
> and polling for each phase of that sequence might help. Since batch jobs tend 
> to be long-running jobs (rather than near-real time jobs), it might help to 
> issue less frequent polls when it's in the RUNNING state. Something on the 
> order of 10's seconds might be reasonable for batch jobs? Maybe the class 
> could expose a parameter for the rate of polling (or a callable)?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency

2019-08-14 Thread Darren Weber (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907749#comment-16907749
 ] 

Darren Weber commented on AIRFLOW-5218:
---

Even bumping the backoff factor from `0.1` to `0.3` might help, e.g.
{code}
from datetime import datetime
from time import sleep

In [18]: for i in [1 + pow(retries * 0.3, 2) for retries in range(10)]: 
...: print(f"{datetime.now()}: sleeping for {i}") 
...: sleep(i) 
...:

  
2019-08-14 18:52:01.688705: sleeping for 1.0
2019-08-14 18:52:02.690385: sleeping for 1.09
2019-08-14 18:52:03.781384: sleeping for 1.3599
2019-08-14 18:52:05.144492: sleeping for 1.8098
2019-08-14 18:52:06.956547: sleeping for 2.44
2019-08-14 18:52:09.401454: sleeping for 3.25
2019-08-14 18:52:12.652212: sleeping for 4.239
2019-08-14 18:52:16.897060: sleeping for 5.41
2019-08-14 18:52:22.313692: sleeping for 6.76
2019-08-14 18:52:29.082087: sleeping for 8.29
{code}

> AWS Batch Operator - status polling too often, esp. for high concurrency
> 
>
> Key: AIRFLOW-5218
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5218
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: aws, contrib
>Affects Versions: 1.10.4
>Reporter: Darren Weber
>Priority: Major
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available 
> and has not been merged in years, see
> - https://github.com/boto/botocore/pull/1307
> - see also https://github.com/broadinstitute/cromwell/issues/4303
> This is a curious case of premature optimization.  So, in the meantime, this 
> means that the fallback is the exponential backoff routine for the status 
> checks on the batch job.  Unfortunately, when the concurrency of Airflow jobs 
> is very high (100's of tasks), this fallback polling hits the AWS Batch API 
> too hard and the AWS API throttle throws an error, which fails the Airflow 
> task, simply because the status is polled too frequently.
> Check the output from the retry algorithm, e.g. within the first 10 retries, 
> the status of an AWS batch job is checked about 10 times at a rate that is 
> approx 1 retry/sec.  When an Airflow instance is running 10's or 100's of 
> concurrent batch jobs, this hits the API too frequently and crashes the 
> Airflow task (plus it occupies a worker in too much busy work).
> In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)]  
>   
>   
> Out[4]: 
> [1.0,
>  1.01,
>  1.04,
>  1.09,
>  1.1601,
>  1.25,
>  1.36,
>  1.4902,
>  1.6401,
>  1.81,
>  2.0,
>  2.21,
>  2.4404,
>  2.6904,
>  2.9604,
>  3.25,
>  3.5605,
>  3.8906,
>  4.24,
>  4.61]
> Possible solutions are to introduce an initial sleep (say 60 sec?) right 
> after issuing the request, so that the batch job has some time to spin up.  
> The job progresses through a through phases before it gets to RUNNING state 
> and polling for each phase of that sequence might help.  Since batch jobs 
> tend to be long-running jobs (rather than near-real time jobs), it might help 
> to issue less frequent polls when it's in the RUNNING state.  Something on 
> the order of 10's seconds might be reasonable for batch jobs?  Maybe the 
> class could expose a parameter for the rate of polling (or a callable)?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency

2019-08-14 Thread Darren Weber (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5218?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Darren Weber updated AIRFLOW-5218:
--
Description: 
The AWS Batch Operator attempts to use a boto3 feature that is not available 
and has not been merged in years, see
 - [https://github.com/boto/botocore/pull/1307]
 - see also [https://github.com/broadinstitute/cromwell/issues/4303]

This is a curious case of premature optimization. So, in the meantime, this 
means that the fallback is the exponential backoff routine for the status 
checks on the batch job. Unfortunately, when the concurrency of Airflow jobs is 
very high (100's of tasks), this fallback polling hits the AWS Batch API too 
hard and the AWS API throttle throws an error, which fails the Airflow task, 
simply because the status is polled too frequently.

Check the output from the retry algorithm, e.g. within the first 10 retries, 
the status of an AWS batch job is checked about 10 times at a rate that is 
approx 1 retry/sec. When an Airflow instance is running 10's or 100's of 
concurrent batch jobs, this hits the API too frequently and crashes the Airflow 
task (plus it occupies a worker in too much busy work).
{code:java}
In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)] 
 Out[4]: 
 [1.0,
 1.01,
 1.04,
 1.09,
 1.1601,
 1.25,
 1.36,
 1.4902,
 1.6401,
 1.81,
 2.0,
 2.21,
 2.4404,
 2.6904,
 2.9604,
 3.25,
 3.5605,
 3.8906,
 4.24,
 4.61]{code}
Possible solutions are to introduce an initial sleep (say 60 sec?) right after 
issuing the request, so that the batch job has some time to spin up. The job 
progresses through a through phases before it gets to RUNNING state and polling 
for each phase of that sequence might help. Since batch jobs tend to be 
long-running jobs (rather than near-real time jobs), it might help to issue 
less frequent polls when it's in the RUNNING state. Something on the order of 
10's seconds might be reasonable for batch jobs? Maybe the class could expose a 
parameter for the rate of polling (or a callable)?

  was:
The AWS Batch Operator attempts to use a boto3 feature that is not available 
and has not been merged in years, see

- https://github.com/boto/botocore/pull/1307
- see also https://github.com/broadinstitute/cromwell/issues/4303

This is a curious case of premature optimization.  So, in the meantime, this 
means that the fallback is the exponential backoff routine for the status 
checks on the batch job.  Unfortunately, when the concurrency of Airflow jobs 
is very high (100's of tasks), this fallback polling hits the AWS Batch API too 
hard and the AWS API throttle throws an error, which fails the Airflow task, 
simply because the status is polled too frequently.

Check the output from the retry algorithm, e.g. within the first 10 retries, 
the status of an AWS batch job is checked about 10 times at a rate that is 
approx 1 retry/sec.  When an Airflow instance is running 10's or 100's of 
concurrent batch jobs, this hits the API too frequently and crashes the Airflow 
task (plus it occupies a worker in too much busy work).

In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)]

  
Out[4]: 
[1.0,
 1.01,
 1.04,
 1.09,
 1.1601,
 1.25,
 1.36,
 1.4902,
 1.6401,
 1.81,
 2.0,
 2.21,
 2.4404,
 2.6904,
 2.9604,
 3.25,
 3.5605,
 3.8906,
 4.24,
 4.61]


Possible solutions are to introduce an initial sleep (say 60 sec?) right after 
issuing the request, so that the batch job has some time to spin up.  The job 
progresses through a through phases before it gets to RUNNING state and polling 
for each phase of that sequence might help.  Since batch jobs tend to be 
long-running jobs (rather than near-real time jobs), it might help to issue 
less frequent polls when it's in the RUNNING state.  Something on the order of 
10's seconds might be reasonable for batch jobs?  Maybe the class could expose 
a parameter for the rate of polling (or a callable)?



> AWS Batch Operator - status polling too often, esp. for high concurrency
> 
>
> Key: AIRFLOW-5218
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5218
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: aws, contrib
>Affects Versions: 1.10.4
>Reporter: Darren Weber
>Priority: Major
>
> The AWS Batch Operator attempts to use a boto3 feature that is not available 
> and has not been merged in years, see
>  - [https://github.com/boto/botocore/pull/1307]
>  - see also 

[jira] [Updated] (AIRFLOW-5170) Add static checks for encoding pragma, consistent licences for python files and related pylint fixes

2019-08-14 Thread Jarek Potiuk (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5170?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jarek Potiuk updated AIRFLOW-5170:
--
Description: 
Automated check for encoding pragma, consisten licence files can be added for 
python files.

Since we have pylint checks in pre-commits added we should also make sure to 
fix all pylint related changes however for all the changed python files.

  was:Automated check for encoding pragma can be easily added. Since we have 
pylint checks in pre-commits added we should also make sure to fix all pylint 
related changes however.

Summary: Add static checks for encoding pragma, consistent licences for 
python files and related pylint fixes  (was: Add static checks for encoding 
pragma (and related pylint fixes))

> Add static checks for encoding pragma, consistent licences for python files 
> and related pylint fixes
> 
>
> Key: AIRFLOW-5170
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5170
> Project: Apache Airflow
>  Issue Type: Sub-task
>  Components: ci
>Affects Versions: 2.0.0
>Reporter: Jarek Potiuk
>Assignee: Jarek Potiuk
>Priority: Major
>
> Automated check for encoding pragma, consisten licence files can be added for 
> python files.
> Since we have pylint checks in pre-commits added we should also make sure to 
> fix all pylint related changes however for all the changed python files.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (AIRFLOW-5218) AWS Batch Operator - status polling too often, esp. for high concurrency

2019-08-14 Thread Darren Weber (JIRA)
Darren Weber created AIRFLOW-5218:
-

 Summary: AWS Batch Operator - status polling too often, esp. for 
high concurrency
 Key: AIRFLOW-5218
 URL: https://issues.apache.org/jira/browse/AIRFLOW-5218
 Project: Apache Airflow
  Issue Type: Improvement
  Components: aws, contrib
Affects Versions: 1.10.4
Reporter: Darren Weber


The AWS Batch Operator attempts to use a boto3 feature that is not available 
and has not been merged in years, see

- https://github.com/boto/botocore/pull/1307
- see also https://github.com/broadinstitute/cromwell/issues/4303

This is a curious case of premature optimization.  So, in the meantime, this 
means that the fallback is the exponential backoff routine for the status 
checks on the batch job.  Unfortunately, when the concurrency of Airflow jobs 
is very high (100's of tasks), this fallback polling hits the AWS Batch API too 
hard and the AWS API throttle throws an error, which fails the Airflow task, 
simply because the status is polled too frequently.

Check the output from the retry algorithm, e.g. within the first 10 retries, 
the status of an AWS batch job is checked about 10 times at a rate that is 
approx 1 retry/sec.  When an Airflow instance is running 10's or 100's of 
concurrent batch jobs, this hits the API too frequently and crashes the Airflow 
task (plus it occupies a worker in too much busy work).

In [4]: [1 + pow(retries * 0.1, 2) for retries in range(20)]

  
Out[4]: 
[1.0,
 1.01,
 1.04,
 1.09,
 1.1601,
 1.25,
 1.36,
 1.4902,
 1.6401,
 1.81,
 2.0,
 2.21,
 2.4404,
 2.6904,
 2.9604,
 3.25,
 3.5605,
 3.8906,
 4.24,
 4.61]


Possible solutions are to introduce an initial sleep (say 60 sec?) right after 
issuing the request, so that the batch job has some time to spin up.  The job 
progresses through a through phases before it gets to RUNNING state and polling 
for each phase of that sequence might help.  Since batch jobs tend to be 
long-running jobs (rather than near-real time jobs), it might help to issue 
less frequent polls when it's in the RUNNING state.  Something on the order of 
10's seconds might be reasonable for batch jobs?  Maybe the class could expose 
a parameter for the rate of polling (or a callable)?




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Closed] (AIRFLOW-5207) Mark Success and Mark Failed views error out due to DAG reassignment

2019-08-14 Thread Marcus Levine (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Marcus Levine closed AIRFLOW-5207.
--
Resolution: Not A Problem

This turned out to be an issue with one of our plugins

> Mark Success and Mark Failed views error out due to DAG reassignment
> 
>
> Key: AIRFLOW-5207
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5207
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: ui
>Affects Versions: 1.10.4
>Reporter: Marcus Levine
>Assignee: Marcus Levine
>Priority: Major
> Fix For: 1.10.5
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When trying to clear a task after upgrading to 1.10.4, I get the following 
> traceback:
> {code:java}
> File "/usr/local/lib/python3.7/site-packages/airflow/www/views.py", line 
> 1451, in failed future, past, State.FAILED) File 
> "/usr/local/lib/python3.7/site-packages/airflow/www/views.py", line 1396, in 
> _mark_task_instance_state task.dag = dag File 
> "/usr/local/lib/python3.7/site-packages/airflow/models/baseoperator.py", line 
> 509, in dag "The DAG assigned to {} can not be changed.".format(self)) 
> airflow.exceptions.AirflowException: The DAG assigned to 
>  can not be changed.{code}
> This should be a simple fix by either dropping the offending line, or if it 
> is required to keep things working, just set the private attribute instead:
> {code:java}
> task._dag = dag
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (AIRFLOW-5207) Mark Success and Mark Failed views error out due to DAG reassignment

2019-08-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907716#comment-16907716
 ] 

ASF GitHub Bot commented on AIRFLOW-5207:
-

marcusianlevine commented on pull request #5811: [AIRFLOW-5207] Fix Mark 
Success and Failure views
URL: https://github.com/apache/airflow/pull/5811
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Mark Success and Mark Failed views error out due to DAG reassignment
> 
>
> Key: AIRFLOW-5207
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5207
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: ui
>Affects Versions: 1.10.4
>Reporter: Marcus Levine
>Assignee: Marcus Levine
>Priority: Major
> Fix For: 1.10.5
>
>   Original Estimate: 1h
>  Remaining Estimate: 1h
>
> When trying to clear a task after upgrading to 1.10.4, I get the following 
> traceback:
> {code:java}
> File "/usr/local/lib/python3.7/site-packages/airflow/www/views.py", line 
> 1451, in failed future, past, State.FAILED) File 
> "/usr/local/lib/python3.7/site-packages/airflow/www/views.py", line 1396, in 
> _mark_task_instance_state task.dag = dag File 
> "/usr/local/lib/python3.7/site-packages/airflow/models/baseoperator.py", line 
> 509, in dag "The DAG assigned to {} can not be changed.".format(self)) 
> airflow.exceptions.AirflowException: The DAG assigned to 
>  can not be changed.{code}
> This should be a simple fix by either dropping the offending line, or if it 
> is required to keep things working, just set the private attribute instead:
> {code:java}
> task._dag = dag
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[GitHub] [airflow] marcusianlevine commented on issue #5811: [AIRFLOW-5207] Fix Mark Success and Failure views

2019-08-14 Thread GitBox
marcusianlevine commented on issue #5811: [AIRFLOW-5207] Fix Mark Success and 
Failure views
URL: https://github.com/apache/airflow/pull/5811#issuecomment-521479241
 
 
   Nevermind, this turned out to be an issue with one of our dynamic DAG plugins


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] marcusianlevine closed pull request #5811: [AIRFLOW-5207] Fix Mark Success and Failure views

2019-08-14 Thread GitBox
marcusianlevine closed pull request #5811: [AIRFLOW-5207] Fix Mark Success and 
Failure views
URL: https://github.com/apache/airflow/pull/5811
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] dossett commented on issue #5419: [AIRFLOW-XXXX] Update pydoc of mlengine_operator

2019-08-14 Thread GitBox
dossett commented on issue #5419: [AIRFLOW-] Update pydoc of 
mlengine_operator
URL: https://github.com/apache/airflow/pull/5419#issuecomment-521474769
 
 
   Thanks @mik-laj, comment updated


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] potiuk merged pull request #5777: [AIRFLOW-5161] Static checks are run automatically in pre-commit hooks

2019-08-14 Thread GitBox
potiuk merged pull request #5777: [AIRFLOW-5161] Static checks are run 
automatically in pre-commit hooks
URL: https://github.com/apache/airflow/pull/5777
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (AIRFLOW-5161) Add pre-commit hooks to run static checks for only changed files

2019-08-14 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907710#comment-16907710
 ] 

ASF subversion and git services commented on AIRFLOW-5161:
--

Commit 70e937a8d8ff308a9fb9055ceb7ef2c034200b36 in airflow's branch 
refs/heads/master from Jarek Potiuk
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=70e937a ]

[AIRFLOW-5161] Static checks are run automatically in pre-commit hooks (#5777)



> Add pre-commit hooks to run static checks for only changed files
> 
>
> Key: AIRFLOW-5161
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5161
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: ci
>Affects Versions: 2.0.0
>Reporter: Jarek Potiuk
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (AIRFLOW-5161) Add pre-commit hooks to run static checks for only changed files

2019-08-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5161?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907709#comment-16907709
 ] 

ASF GitHub Bot commented on AIRFLOW-5161:
-

potiuk commented on pull request #5777: [AIRFLOW-5161] Static checks are run 
automatically in pre-commit hooks
URL: https://github.com/apache/airflow/pull/5777
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add pre-commit hooks to run static checks for only changed files
> 
>
> Key: AIRFLOW-5161
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5161
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: ci
>Affects Versions: 2.0.0
>Reporter: Jarek Potiuk
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[GitHub] [airflow] pgagnon commented on issue #5824: [AIRFLOW-5215] Add sidecar containers support to Pod class

2019-08-14 Thread GitBox
pgagnon commented on issue #5824: [AIRFLOW-5215] Add sidecar containers support 
to Pod class
URL: https://github.com/apache/airflow/pull/5824#issuecomment-521466853
 
 
   Test failure seems unrelated.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (AIRFLOW-5217) Fix Pod docstring

2019-08-14 Thread Philippe Gagnon (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5217?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Philippe Gagnon updated AIRFLOW-5217:
-
Description: {{Pod}} class docstring is currently out of date with regards 
to its {{__init__}} method's arguments.  (was: {{Pod}} class docstring is 
currently out of date with regards to its {{__init__}} method's docstring.)

> Fix Pod docstring
> -
>
> Key: AIRFLOW-5217
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5217
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: executors
>Affects Versions: 2.0.0
>Reporter: Philippe Gagnon
>Assignee: Philippe Gagnon
>Priority: Minor
>
> {{Pod}} class docstring is currently out of date with regards to its 
> {{__init__}} method's arguments.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (AIRFLOW-5217) Fix Pod docstring

2019-08-14 Thread Philippe Gagnon (JIRA)
Philippe Gagnon created AIRFLOW-5217:


 Summary: Fix Pod docstring
 Key: AIRFLOW-5217
 URL: https://issues.apache.org/jira/browse/AIRFLOW-5217
 Project: Apache Airflow
  Issue Type: Improvement
  Components: executors
Affects Versions: 2.0.0
Reporter: Philippe Gagnon
Assignee: Philippe Gagnon


{{Pod}} class docstring is currently out of date with regards to its 
{{__init__}} method's docstring.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[GitHub] [airflow] pgagnon commented on issue #5824: [AIRFLOW-5215] Add sidecar containers support to Pod class

2019-08-14 Thread GitBox
pgagnon commented on issue #5824: [AIRFLOW-5215] Add sidecar containers support 
to Pod class
URL: https://github.com/apache/airflow/pull/5824#issuecomment-521460427
 
 
   @ashb @dimberman 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (AIRFLOW-5215) Add sidecar container support to Pod object

2019-08-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5215?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907695#comment-16907695
 ] 

ASF GitHub Bot commented on AIRFLOW-5215:
-

pgagnon commented on pull request #5824: [AIRFLOW-5215] Add sidecar containers 
support to Pod class
URL: https://github.com/apache/airflow/pull/5824
 
 
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [X] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references 
them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR"
 - https://issues.apache.org/jira/browse/AIRFLOW-XXX
 - In case you are fixing a typo in the documentation you can prepend your 
commit with \[AIRFLOW-XXX\], code changes always need a Jira issue.
 - In case you are proposing a fundamental code change, you need to create 
an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)).
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Description
   
   - [X] Here are some details about my PR, including screenshots of any UI 
changes:
   
   Adds a `sidecar_containers` argument to `Pod`, allowing users to pass a list 
of sidecar container definitions to add to the Pod. This is notably useful with 
the pod mutation hook.
   
   ### Tests
   
   - [X] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   - `test_extract_sidecar_containers`.
   
   ### Commits
   
   - [X] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [X] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   
   `Pod`'s docstring is currently not up to date. Will address in a subsequent 
PR.
   
   ### Code Quality
   
   - [X] Passes `flake8`
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add sidecar container support to Pod object
> ---
>
> Key: AIRFLOW-5215
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5215
> Project: Apache Airflow
>  Issue Type: New Feature
>  Components: scheduler
>Affects Versions: 2.0.0
>Reporter: Philippe Gagnon
>Assignee: Philippe Gagnon
>Priority: Major
>
> Add sidecar container support to Pod object.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[GitHub] [airflow] pgagnon opened a new pull request #5824: [AIRFLOW-5215] Add sidecar containers support to Pod class

2019-08-14 Thread GitBox
pgagnon opened a new pull request #5824: [AIRFLOW-5215] Add sidecar containers 
support to Pod class
URL: https://github.com/apache/airflow/pull/5824
 
 
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [X] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references 
them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR"
 - https://issues.apache.org/jira/browse/AIRFLOW-XXX
 - In case you are fixing a typo in the documentation you can prepend your 
commit with \[AIRFLOW-XXX\], code changes always need a Jira issue.
 - In case you are proposing a fundamental code change, you need to create 
an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)).
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Description
   
   - [X] Here are some details about my PR, including screenshots of any UI 
changes:
   
   Adds a `sidecar_containers` argument to `Pod`, allowing users to pass a list 
of sidecar container definitions to add to the Pod. This is notably useful with 
the pod mutation hook.
   
   ### Tests
   
   - [X] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   - `test_extract_sidecar_containers`.
   
   ### Commits
   
   - [X] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [X] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   
   `Pod`'s docstring is currently not up to date. Will address in a subsequent 
PR.
   
   ### Code Quality
   
   - [X] Passes `flake8`
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] mik-laj opened a new pull request #5823: [AIRFLOW-XXX] Create "Using the CLI" page

2019-08-14 Thread GitBox
mik-laj opened a new pull request #5823: [AIRFLOW-XXX] Create "Using the CLI" 
page
URL: https://github.com/apache/airflow/pull/5823
 
 
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references 
them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR"
 - https://issues.apache.org/jira/browse/AIRFLOW-XXX
 - In case you are fixing a typo in the documentation you can prepend your 
commit with \[AIRFLOW-XXX\], code changes always need a Jira issue.
 - In case you are proposing a fundamental code change, you need to create 
an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)).
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Description
   
   - [ ] Here are some details about my PR, including screenshots of any UI 
changes:
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   
   ### Code Quality
   
   - [ ] Passes `flake8`
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

2019-08-14 Thread GitBox
kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] 
Persisting serialized DAG in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#discussion_r314102570
 
 

 ##
 File path: airflow/models/serialized_dag.py
 ##
 @@ -0,0 +1,155 @@
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""Serialzed DAG table in database."""
+
+import hashlib
+from typing import Any, Dict, List, Optional, TYPE_CHECKING
+from sqlalchemy import Column, Index, Integer, String, Text, and_
+from sqlalchemy.sql import exists
+
+from airflow.models.base import Base, ID_LEN
+from airflow.utils import db, timezone
+from airflow.utils.sqlalchemy import UtcDateTime
+
+
+if TYPE_CHECKING:
+from airflow.dag.serialization.serialized_dag import SerializedDAG  # 
noqa: F401, E501; # pylint: disable=cyclic-import
+from airflow.models import DAG  # noqa: F401; # pylint: 
disable=cyclic-import
+
+
+class SerializedDagModel(Base):
+"""A table for serialized DAGs.
+
+serialized_dag table is a snapshot of DAG files synchronized by scheduler.
+This feature is controlled by:
+[core] dagcached = False: enable this feature
+[core] dagcached_min_update_interval = 30 (s):
+serialized DAGs are updated in DB when a file gets processed by 
scheduler,
+to reduce DB write rate, there is a minimal interval of updating 
serialized DAGs.
+[scheduler] dag_dir_list_interval = 300 (s):
+interval of deleting serialized DAGs in DB when the files are 
deleted, suggest
+to use a smaller interval such as 60
+
+It is used by webserver to load dagbags when dagcached=True. Because 
reading from
+database is lightweight compared to importing from files, it solves the 
webserver
+scalability issue.
+"""
+__tablename__ = 'serialized_dag'
+
+dag_id = Column(String(ID_LEN), primary_key=True)
+fileloc = Column(String(2000))
+# The max length of fileloc exceeds the limit of indexing.
+fileloc_hash = Column(Integer)
+data = Column(Text)
+last_updated = Column(UtcDateTime)
+
+__table_args__ = (
+Index('idx_fileloc_hash', fileloc_hash, unique=False),
+)
+
+def __init__(self, dag):
+from airflow.dag.serialization import Serialization
+
+self.dag_id = dag.dag_id
+self.fileloc = dag.full_filepath
+self.fileloc_hash = SerializedDagModel.dag_fileloc_hash(self.fileloc)
+self.data = Serialization.to_json(dag)
 
 Review comment:
   > Either here, or inside to_json we should ensure that the JSON blob is 
valid - I want to minimize the chance of writing "odd"/invalid data in to our 
DB.
   
   Done in 
https://github.com/apache/airflow/pull/5743/commits/977a2fe3fd244bc3f366a1228324f8b3c58f30ac
 . WDYT - is that OK?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] mik-laj merged pull request #5776: [AIRFLOW-XXX] Group references in one section

2019-08-14 Thread GitBox
mik-laj merged pull request #5776: [AIRFLOW-XXX] Group references in one section
URL: https://github.com/apache/airflow/pull/5776
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (AIRFLOW-3333) New features enable transferring of files or data from GCS to a SFTP remote path and SFTP to GCS path.

2019-08-14 Thread Kamil Bregula (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907620#comment-16907620
 ] 

Kamil Bregula commented on AIRFLOW-:


[~pulinpathneja] Any progress? Maybe I can help in some way.

> New features enable transferring of files or data from GCS to a SFTP remote 
> path and SFTP to GCS path. 
> ---
>
> Key: AIRFLOW-
> URL: https://issues.apache.org/jira/browse/AIRFLOW-
> Project: Apache Airflow
>  Issue Type: New Feature
>  Components: contrib, gcp
>Reporter: Pulin Pathneja
>Assignee: Pulin Pathneja
>Priority: Major
>
> New features enable transferring of files or data from GCS(Google Cloud 
> Storage) to a SFTP remote path and SFTP to GCS(Google Cloud Storage) path. 
>   



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (AIRFLOW-4758) Add GoogleCloudStorageToGoogleDrive Operator

2019-08-14 Thread Kamil Bregula (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-4758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907598#comment-16907598
 ] 

Kamil Bregula commented on AIRFLOW-4758:


I have created an operator that copies data from GCS to GDrive. Writing an 
operator that copies directories between GDrive and GCS will not be easy, 
because GDrive stores files in graphs. The directory structure may contain 
cycles. It is possible to write an operator that copies one file from GDrive, 
but its usability will be very limited. 

What do you think?

> Add GoogleCloudStorageToGoogleDrive Operator
> 
>
> Key: AIRFLOW-4758
> URL: https://issues.apache.org/jira/browse/AIRFLOW-4758
> Project: Apache Airflow
>  Issue Type: Wish
>  Components: gcp, operators
>Affects Versions: 1.10.3
>Reporter: jack
>Priority: Major
>
> Add Operators:
> GoogleCloudStorageToGoogleDrive
> GoogleDriveToGoogleCloudStorage
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Assigned] (AIRFLOW-5176) Add integration with Azure Data Explorer

2019-08-14 Thread Michael Spector (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5176?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Spector reassigned AIRFLOW-5176:


Assignee: (was: Michael Spector)

> Add integration with Azure Data Explorer
> 
>
> Key: AIRFLOW-5176
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5176
> Project: Apache Airflow
>  Issue Type: New Feature
>  Components: hooks, operators
>Affects Versions: 1.10.4, 2.0.0
>Reporter: Michael Spector
>Priority: Major
>
> Add a hook and an operator that allow sending queries to Azure Data Explorer 
> (Kusto) cluster.
> ADX (Azure Data Explorer) is relatively new but very promising analytics data 
> store / data processing offering in Azure.
> PR: https://github.com/apache/airflow/pull/5785



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (AIRFLOW-4758) Add GoogleCloudStorageToGoogleDrive Operator

2019-08-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-4758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907597#comment-16907597
 ] 

ASF GitHub Bot commented on AIRFLOW-4758:
-

mik-laj commented on pull request #5822: [AIRFLOW-4758] Add GcsToGDriveOperator
URL: https://github.com/apache/airflow/pull/5822
 
 
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references 
them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR"
 - https://issues.apache.org/jira/browse/AIRFLOW-4758
 - In case you are fixing a typo in the documentation you can prepend your 
commit with \[AIRFLOW-XXX\], code changes always need a Jira issue.
 - In case you are proposing a fundamental code change, you need to create 
an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)).
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Description
   
   - [ ] Here are some details about my PR, including screenshots of any UI 
changes:
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   
   ### Code Quality
   
   - [ ] Passes `flake8`
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Add GoogleCloudStorageToGoogleDrive Operator
> 
>
> Key: AIRFLOW-4758
> URL: https://issues.apache.org/jira/browse/AIRFLOW-4758
> Project: Apache Airflow
>  Issue Type: Wish
>  Components: gcp, operators
>Affects Versions: 1.10.3
>Reporter: jack
>Priority: Major
>
> Add Operators:
> GoogleCloudStorageToGoogleDrive
> GoogleDriveToGoogleCloudStorage
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[GitHub] [airflow] matwerber1 commented on issue #4068: [AIRFLOW-2310]: Add AWS Glue Job Compatibility to Airflow

2019-08-14 Thread GitBox
matwerber1 commented on issue #4068: [AIRFLOW-2310]: Add AWS Glue Job 
Compatibility to Airflow
URL: https://github.com/apache/airflow/pull/4068#issuecomment-521402915
 
 
   I see the merge failed from what is (hopefully?) a small conflict - can we 
get eyes on this? can I help? 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] mik-laj opened a new pull request #5822: [AIRFLOW-4758] Add GcsToGDriveOperator

2019-08-14 Thread GitBox
mik-laj opened a new pull request #5822: [AIRFLOW-4758] Add GcsToGDriveOperator
URL: https://github.com/apache/airflow/pull/5822
 
 
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references 
them in the PR title. For example, "\[AIRFLOW-XXX\] My Airflow PR"
 - https://issues.apache.org/jira/browse/AIRFLOW-4758
 - In case you are fixing a typo in the documentation you can prepend your 
commit with \[AIRFLOW-XXX\], code changes always need a Jira issue.
 - In case you are proposing a fundamental code change, you need to create 
an Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals)).
 - In case you are adding a dependency, check if the license complies with 
the [ASF 3rd Party License 
Policy](https://www.apache.org/legal/resolved.html#category-x).
   
   ### Description
   
   - [ ] Here are some details about my PR, including screenshots of any UI 
changes:
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   
   ### Code Quality
   
   - [ ] Passes `flake8`
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

2019-08-14 Thread GitBox
kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] 
Persisting serialized DAG in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#discussion_r314046484
 
 

 ##
 File path: airflow/models/serialized_dag.py
 ##
 @@ -0,0 +1,155 @@
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""Serialzed DAG table in database."""
+
+import hashlib
+from typing import Any, Dict, List, Optional, TYPE_CHECKING
+from sqlalchemy import Column, Index, Integer, String, Text, and_
+from sqlalchemy.sql import exists
+
+from airflow.models.base import Base, ID_LEN
+from airflow.utils import db, timezone
+from airflow.utils.sqlalchemy import UtcDateTime
+
+
+if TYPE_CHECKING:
+from airflow.dag.serialization.serialized_dag import SerializedDAG  # 
noqa: F401, E501; # pylint: disable=cyclic-import
+from airflow.models import DAG  # noqa: F401; # pylint: 
disable=cyclic-import
+
+
+class SerializedDagModel(Base):
+"""A table for serialized DAGs.
+
+serialized_dag table is a snapshot of DAG files synchronized by scheduler.
+This feature is controlled by:
+[core] dagcached = False: enable this feature
+[core] dagcached_min_update_interval = 30 (s):
+serialized DAGs are updated in DB when a file gets processed by 
scheduler,
+to reduce DB write rate, there is a minimal interval of updating 
serialized DAGs.
+[scheduler] dag_dir_list_interval = 300 (s):
+interval of deleting serialized DAGs in DB when the files are 
deleted, suggest
+to use a smaller interval such as 60
+
+It is used by webserver to load dagbags when dagcached=True. Because 
reading from
+database is lightweight compared to importing from files, it solves the 
webserver
+scalability issue.
+"""
+__tablename__ = 'serialized_dag'
+
+dag_id = Column(String(ID_LEN), primary_key=True)
+fileloc = Column(String(2000))
+# The max length of fileloc exceeds the limit of indexing.
+fileloc_hash = Column(Integer)
+data = Column(Text)
+last_updated = Column(UtcDateTime)
+
+__table_args__ = (
+Index('idx_fileloc_hash', fileloc_hash, unique=False),
+)
+
+def __init__(self, dag):
+from airflow.dag.serialization import Serialization
+
+self.dag_id = dag.dag_id
+self.fileloc = dag.full_filepath
+self.fileloc_hash = SerializedDagModel.dag_fileloc_hash(self.fileloc)
+self.data = Serialization.to_json(dag)
+self.last_updated = timezone.utcnow()
+
+@staticmethod
+def dag_fileloc_hash(full_filepath: str) -> int:
+Hashing file location for indexing.
+
+:param full_filepath: full filepath of DAG file
+:return: hashed full_filepath
+"""
+# hashing is needed because the length of fileloc is 2000 as an 
Airflow convention,
+# which is over the limit of indexing. If we can reduce the length of 
fileloc, then
+# hashing is not needed.
+return int(0x & int(
+hashlib.sha1(full_filepath.encode('utf-8')).hexdigest(), 16))
+
+@classmethod
+def write_dag(cls, dag: 'DAG', min_update_interval: Optional[int] = None):
+"""Serializes a DAG and writes it into database.
+
+:param dag: a DAG to be written into database
+:param min_update_interval: minimal interval in seconds to update 
serialized DAG
+"""
+with db.create_session() as session:
+if min_update_interval is not None:
+result = session.query(cls.last_updated).filter(
+cls.dag_id == dag.dag_id).first()
+if result is not None and (
+timezone.utcnow() - 
result.last_updated).total_seconds() < min_update_interval:
+return
+session.merge(cls(dag))
+
+@classmethod
+def read_all_dags(cls) -> Dict[str, 'SerializedDAG']:
+"""Reads all DAGs in serialized_dag table.
+
+:returns: a dict of DAGs read from database
+"""
+from airflow.dag.serialization import Serialization
+
+with db.create_session() as session:
+

[GitHub] [airflow] kaxil edited a comment on issue #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

2019-08-14 Thread GitBox
kaxil edited a comment on issue #5743: [AIRFLOW-5088][AIP-24] Persisting 
serialized DAG in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#issuecomment-520101071
 
 
   Pending Issues:
   
   - ~Add Timezone support to `serialized_dag` table~
   - ~We still have the issue of `SerializedBaseOperator` being displayed in 
Graph View.~
   
![image](https://user-images.githubusercontent.com/8811558/62814712-56b0b880-bb0a-11e9-9ef0-0dd9090b624b.png)
   - ~Issue displaying SubDags~:
   
![image](https://user-images.githubusercontent.com/8811558/62814991-66c99780-bb0c-11e9-9a36-f692b2ec3db5.png)
   
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] JCoder01 commented on issue #5672: [AIRFLOW-5056] Add argument to filter mails in ImapHook and related operators

2019-08-14 Thread GitBox
JCoder01 commented on issue #5672: [AIRFLOW-5056] Add argument to filter mails 
in ImapHook and related operators
URL: https://github.com/apache/airflow/pull/5672#issuecomment-521384333
 
 
   I'm actually not using IMAP anymore after the powers that be made me switch 
to office 365 and disabled IMAPs access, but looking it over, it looks good. 


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

2019-08-14 Thread GitBox
kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] 
Persisting serialized DAG in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#discussion_r314042957
 
 

 ##
 File path: airflow/models/dag.py
 ##
 @@ -1509,8 +1518,19 @@ def get_last_dagrun(self, session=None, 
include_externally_triggered=False):
 def safe_dag_id(self):
 return self.dag_id.replace('.', '__dot__')
 
-def get_dag(self):
-return DagBag(dag_folder=self.fileloc).get_dag(self.dag_id)
+def get_dag(self, dagcached_enabled=False):
+"""Creates a dagbag to load and return a DAG.
+
+Calling it from UI should set dagcached_enabled = DAGCACHED_ENABLED.
+There may be a delay for scheduler to write serialized DAG into 
database,
+loads from file in this case.
+FIXME: removes it when webserver does not access to DAG folder in 
future.
+"""
+dag = DagBag(
+dag_folder=self.fileloc, 
dagcached_enabled=dagcached_enabled).get_dag(self.dag_id)
+if dagcached_enabled and dag is None:
 
 Review comment:
   >There may be a delay for scheduler to write serialized DAG into database, 
loads from file in this case.
   
   I guess the idea is if for any reason (connectivity probably or some other 
DB issue) this method will load the DAG from file, hence it uses recursion to 
reload DagBag without cache_enabled.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

2019-08-14 Thread GitBox
kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] 
Persisting serialized DAG in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#discussion_r314041450
 
 

 ##
 File path: airflow/models/serialized_dag.py
 ##
 @@ -0,0 +1,155 @@
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""Serialzed DAG table in database."""
+
+import hashlib
+from typing import Any, Dict, List, Optional, TYPE_CHECKING
+from sqlalchemy import Column, Index, Integer, String, Text, and_
+from sqlalchemy.sql import exists
+
+from airflow.models.base import Base, ID_LEN
+from airflow.utils import db, timezone
+from airflow.utils.sqlalchemy import UtcDateTime
+
+
+if TYPE_CHECKING:
+from airflow.dag.serialization.serialized_dag import SerializedDAG  # 
noqa: F401, E501; # pylint: disable=cyclic-import
+from airflow.models import DAG  # noqa: F401; # pylint: 
disable=cyclic-import
+
+
+class SerializedDagModel(Base):
+"""A table for serialized DAGs.
+
+serialized_dag table is a snapshot of DAG files synchronized by scheduler.
+This feature is controlled by:
+[core] dagcached = False: enable this feature
+[core] dagcached_min_update_interval = 30 (s):
+serialized DAGs are updated in DB when a file gets processed by 
scheduler,
+to reduce DB write rate, there is a minimal interval of updating 
serialized DAGs.
+[scheduler] dag_dir_list_interval = 300 (s):
+interval of deleting serialized DAGs in DB when the files are 
deleted, suggest
+to use a smaller interval such as 60
 
 Review comment:
   Sorry didn't understand this one, what do you mean?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

2019-08-14 Thread GitBox
kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] 
Persisting serialized DAG in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#discussion_r314040957
 
 

 ##
 File path: airflow/models/serialized_dag.py
 ##
 @@ -0,0 +1,155 @@
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""Serialzed DAG table in database."""
+
+import hashlib
+from typing import Any, Dict, List, Optional, TYPE_CHECKING
+from sqlalchemy import Column, Index, Integer, String, Text, and_
+from sqlalchemy.sql import exists
+
+from airflow.models.base import Base, ID_LEN
+from airflow.utils import db, timezone
+from airflow.utils.sqlalchemy import UtcDateTime
+
+
+if TYPE_CHECKING:
+from airflow.dag.serialization.serialized_dag import SerializedDAG  # 
noqa: F401, E501; # pylint: disable=cyclic-import
+from airflow.models import DAG  # noqa: F401; # pylint: 
disable=cyclic-import
+
+
+class SerializedDagModel(Base):
+"""A table for serialized DAGs.
+
+serialized_dag table is a snapshot of DAG files synchronized by scheduler.
+This feature is controlled by:
+[core] dagcached = False: enable this feature
+[core] dagcached_min_update_interval = 30 (s):
+serialized DAGs are updated in DB when a file gets processed by 
scheduler,
+to reduce DB write rate, there is a minimal interval of updating 
serialized DAGs.
+[scheduler] dag_dir_list_interval = 300 (s):
+interval of deleting serialized DAGs in DB when the files are 
deleted, suggest
+to use a smaller interval such as 60
+
+It is used by webserver to load dagbags when dagcached=True. Because 
reading from
+database is lightweight compared to importing from files, it solves the 
webserver
+scalability issue.
+"""
+__tablename__ = 'serialized_dag'
+
+dag_id = Column(String(ID_LEN), primary_key=True)
+fileloc = Column(String(2000))
+# The max length of fileloc exceeds the limit of indexing.
+fileloc_hash = Column(Integer)
+data = Column(Text)
+last_updated = Column(UtcDateTime)
+
+__table_args__ = (
+Index('idx_fileloc_hash', fileloc_hash, unique=False),
+)
+
+def __init__(self, dag):
+from airflow.dag.serialization import Serialization
+
+self.dag_id = dag.dag_id
+self.fileloc = dag.full_filepath
+self.fileloc_hash = SerializedDagModel.dag_fileloc_hash(self.fileloc)
+self.data = Serialization.to_json(dag)
+self.last_updated = timezone.utcnow()
+
+@staticmethod
+def dag_fileloc_hash(full_filepath: str) -> int:
+Hashing file location for indexing.
+
+:param full_filepath: full filepath of DAG file
+:return: hashed full_filepath
+"""
+# hashing is needed because the length of fileloc is 2000 as an 
Airflow convention,
+# which is over the limit of indexing. If we can reduce the length of 
fileloc, then
+# hashing is not needed.
+return int(0x & int(
+hashlib.sha1(full_filepath.encode('utf-8')).hexdigest(), 16))
+
+@classmethod
+def write_dag(cls, dag: 'DAG', min_update_interval: Optional[int] = None):
+"""Serializes a DAG and writes it into database.
+
+:param dag: a DAG to be written into database
+:param min_update_interval: minimal interval in seconds to update 
serialized DAG
+"""
+with db.create_session() as session:
+if min_update_interval is not None:
+result = session.query(cls.last_updated).filter(
+cls.dag_id == dag.dag_id).first()
+if result is not None and (
+timezone.utcnow() - 
result.last_updated).total_seconds() < min_update_interval:
+return
+session.merge(cls(dag))
+
+@classmethod
+def read_all_dags(cls) -> Dict[str, 'SerializedDAG']:
+"""Reads all DAGs in serialized_dag table.
+
+:returns: a dict of DAGs read from database
+"""
+from airflow.dag.serialization import Serialization
+
+with db.create_session() as session:
+

[GitHub] [airflow] kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

2019-08-14 Thread GitBox
kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] 
Persisting serialized DAG in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#discussion_r314041221
 
 

 ##
 File path: airflow/models/serialized_dag.py
 ##
 @@ -0,0 +1,155 @@
+# -*- coding: utf-8 -*-
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+
+"""Serialzed DAG table in database."""
+
+import hashlib
+from typing import Any, Dict, List, Optional, TYPE_CHECKING
+from sqlalchemy import Column, Index, Integer, String, Text, and_
+from sqlalchemy.sql import exists
+
+from airflow.models.base import Base, ID_LEN
+from airflow.utils import db, timezone
+from airflow.utils.sqlalchemy import UtcDateTime
+
+
+if TYPE_CHECKING:
+from airflow.dag.serialization.serialized_dag import SerializedDAG  # 
noqa: F401, E501; # pylint: disable=cyclic-import
+from airflow.models import DAG  # noqa: F401; # pylint: 
disable=cyclic-import
+
+
+class SerializedDagModel(Base):
+"""A table for serialized DAGs.
+
+serialized_dag table is a snapshot of DAG files synchronized by scheduler.
+This feature is controlled by:
+[core] dagcached = False: enable this feature
+[core] dagcached_min_update_interval = 30 (s):
+serialized DAGs are updated in DB when a file gets processed by 
scheduler,
+to reduce DB write rate, there is a minimal interval of updating 
serialized DAGs.
+[scheduler] dag_dir_list_interval = 300 (s):
+interval of deleting serialized DAGs in DB when the files are 
deleted, suggest
+to use a smaller interval such as 60
+
+It is used by webserver to load dagbags when dagcached=True. Because 
reading from
+database is lightweight compared to importing from files, it solves the 
webserver
+scalability issue.
+"""
+__tablename__ = 'serialized_dag'
+
+dag_id = Column(String(ID_LEN), primary_key=True)
+fileloc = Column(String(2000))
+# The max length of fileloc exceeds the limit of indexing.
+fileloc_hash = Column(Integer)
+data = Column(Text)
+last_updated = Column(UtcDateTime)
+
+__table_args__ = (
+Index('idx_fileloc_hash', fileloc_hash, unique=False),
+)
+
+def __init__(self, dag):
+from airflow.dag.serialization import Serialization
+
+self.dag_id = dag.dag_id
+self.fileloc = dag.full_filepath
+self.fileloc_hash = SerializedDagModel.dag_fileloc_hash(self.fileloc)
+self.data = Serialization.to_json(dag)
+self.last_updated = timezone.utcnow()
+
+@staticmethod
+def dag_fileloc_hash(full_filepath: str) -> int:
+Hashing file location for indexing.
+
+:param full_filepath: full filepath of DAG file
+:return: hashed full_filepath
+"""
+# hashing is needed because the length of fileloc is 2000 as an 
Airflow convention,
+# which is over the limit of indexing. If we can reduce the length of 
fileloc, then
+# hashing is not needed.
+return int(0x & int(
 
 Review comment:
   Will let @coufon answer this one.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] kaxil edited a comment on issue #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

2019-08-14 Thread GitBox
kaxil edited a comment on issue #5743: [AIRFLOW-5088][AIP-24] Persisting 
serialized DAG in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#issuecomment-521358954
 
 
   @zhongjiajie 
   > > Pending Issues:
   > > 
   > > * We still have the issue of `SerializedBaseOperator` being displayed in 
Graph View.
   > 
   > This is because graph.html and tree.html use `op.__class__.__name__`. 
Replaced that by op.task_type to fix it.
   
   Found this issue with that fix: We have a `BashOperator` label for each task 
instead of unique lablels. 
   
![image](https://user-images.githubusercontent.com/8811558/63044726-bd492400-bec6-11e9-9d02-a10198b72d46.png)
   
   This is causes because we are making a dict of unique TaskInstance and not 
Operator Class in **L1335**:
   
   
https://github.com/apache/airflow/blob/b814f8dfd9448ee3ceef2722c7f0291d8a680700/airflow/www/views.py#L1333-L1336
   
   Previously it was comparing Classes directly, hence it would remove 
duplicates. 
   
   
https://github.com/apache/airflow/blob/42bf5cb6782994610c722fb56adfe7b837dfeabb/airflow/www/views.py#L1332-L1338
   
   Fixing this now.
   
   **Fixed** it with 
https://github.com/apache/airflow/pull/5743/commits/7859d787ca70225a32ce9dbc21d87facb59a3143
   
![image](https://user-images.githubusercontent.com/8811558/63048079-a8bc5a00-becd-11e9-8c05-02449f30fffc.png)
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (AIRFLOW-5216) Add Azure File Share Sensor

2019-08-14 Thread Albert Yau (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Albert Yau updated AIRFLOW-5216:

Description: Add sensor to check if a file exists on Azure File Share.  
(was: Add sensor to c**heck if a file exists on Azure File Share.)

> Add Azure File Share Sensor
> ---
>
> Key: AIRFLOW-5216
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5216
> Project: Apache Airflow
>  Issue Type: New Feature
>  Components: contrib
>Affects Versions: 1.10.4
>Reporter: Albert Yau
>Assignee: Albert Yau
>Priority: Minor
>
> Add sensor to check if a file exists on Azure File Share.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (AIRFLOW-5216) Add Azure File Share Sensor

2019-08-14 Thread Albert Yau (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Albert Yau updated AIRFLOW-5216:

Description: Add sensor to check if a file exists on Azure File Share using 
the existing AzureFileShareHook.  (was: Add sensor to check if a file exists on 
Azure File Share.)

> Add Azure File Share Sensor
> ---
>
> Key: AIRFLOW-5216
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5216
> Project: Apache Airflow
>  Issue Type: New Feature
>  Components: contrib
>Affects Versions: 1.10.4
>Reporter: Albert Yau
>Assignee: Albert Yau
>Priority: Minor
>
> Add sensor to check if a file exists on Azure File Share using the existing 
> AzureFileShareHook.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (AIRFLOW-5118) Airflow DataprocClusterCreateOperator does not currently support setting optional components

2019-08-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907535#comment-16907535
 ] 

ASF GitHub Bot commented on AIRFLOW-5118:
-

idralyuk commented on pull request #5821: [AIRFLOW-5118] Add ability to specify 
optional components in Dataproc…
URL: https://github.com/apache/airflow/pull/5821
 
 
   …ClusterCreateOperator
   
   ### Jira
   
   https://issues.apache.org/jira/browse/AIRFLOW-5118
   
   ### Description
   
   This PR adds ability to specify optional components in 
DataprocClusterCreateOperator
   
   For more info see 
https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#Component
   
   ### Tests
   
   One test added: it checks whether the optional components were set correctly
   
   ### Commits
   
   - [X] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [X] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   
   ### Code Quality
   
   - [X] Passes `flake8`
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Airflow DataprocClusterCreateOperator does not currently support setting 
> optional components
> 
>
> Key: AIRFLOW-5118
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5118
> Project: Apache Airflow
>  Issue Type: New Feature
>  Components: operators
>Affects Versions: 1.10.3
>Reporter: Omid Vahdaty
>Assignee: Igor
>Priority: Minor
>
> there need to be an option to install optional components via 
> DataprocClusterCreateOperator . components such as zeppelin.
> From the source code of the DataprocClusterCreateOperator[1], the only 
> software configs that can be set are the imageVersion and the properties. As 
> the Zeppelin component needs to be set through softwareConfig 
> optionalComponents[2], the DataprocClusterCreateOperator does not currently 
> support setting optional components. 
>  
> As a workaround for the time being, you could create your clusters by 
> directly using the gcloud command rather than the 
> DataprocClusterCreateOperator . Using the Airflow BashOperator[4], you can 
> execute gcloud commands that create your Dataproc cluster with the required 
> optional components. 
> [1] 
> [https://github.com/apache/airflow/blob/master/airflow/contrib/operators/dataproc_operator.py]
>  
>  [2] 
> [https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#softwareconfig]
>  
> [3] [https://airflow.apache.org/howto/operator/bash.html] 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[GitHub] [airflow] idralyuk opened a new pull request #5821: [AIRFLOW-5118] Add ability to specify optional components in Dataproc…

2019-08-14 Thread GitBox
idralyuk opened a new pull request #5821: [AIRFLOW-5118] Add ability to specify 
optional components in Dataproc…
URL: https://github.com/apache/airflow/pull/5821
 
 
   …ClusterCreateOperator
   
   ### Jira
   
   https://issues.apache.org/jira/browse/AIRFLOW-5118
   
   ### Description
   
   This PR adds ability to specify optional components in 
DataprocClusterCreateOperator
   
   For more info see 
https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#Component
   
   ### Tests
   
   One test added: it checks whether the optional components were set correctly
   
   ### Commits
   
   - [X] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [X] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   
   ### Code Quality
   
   - [X] Passes `flake8`
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (AIRFLOW-5216) Add Azure File Share Sensor

2019-08-14 Thread Albert Yau (JIRA)
Albert Yau created AIRFLOW-5216:
---

 Summary: Add Azure File Share Sensor
 Key: AIRFLOW-5216
 URL: https://issues.apache.org/jira/browse/AIRFLOW-5216
 Project: Apache Airflow
  Issue Type: New Feature
  Components: contrib
Affects Versions: 1.10.4
Reporter: Albert Yau
Assignee: Albert Yau


Add sensor to c**heck if a file exists on Azure File Share.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (AIRFLOW-5118) Airflow DataprocClusterCreateOperator does not currently support setting optional components

2019-08-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907528#comment-16907528
 ] 

ASF GitHub Bot commented on AIRFLOW-5118:
-

idralyuk commented on pull request #5820: [AIRFLOW-5118] Add ability to specify 
optional components in Dataproc…
URL: https://github.com/apache/airflow/pull/5820
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Airflow DataprocClusterCreateOperator does not currently support setting 
> optional components
> 
>
> Key: AIRFLOW-5118
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5118
> Project: Apache Airflow
>  Issue Type: New Feature
>  Components: operators
>Affects Versions: 1.10.3
>Reporter: Omid Vahdaty
>Assignee: Igor
>Priority: Minor
>
> there need to be an option to install optional components via 
> DataprocClusterCreateOperator . components such as zeppelin.
> From the source code of the DataprocClusterCreateOperator[1], the only 
> software configs that can be set are the imageVersion and the properties. As 
> the Zeppelin component needs to be set through softwareConfig 
> optionalComponents[2], the DataprocClusterCreateOperator does not currently 
> support setting optional components. 
>  
> As a workaround for the time being, you could create your clusters by 
> directly using the gcloud command rather than the 
> DataprocClusterCreateOperator . Using the Airflow BashOperator[4], you can 
> execute gcloud commands that create your Dataproc cluster with the required 
> optional components. 
> [1] 
> [https://github.com/apache/airflow/blob/master/airflow/contrib/operators/dataproc_operator.py]
>  
>  [2] 
> [https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#softwareconfig]
>  
> [3] [https://airflow.apache.org/howto/operator/bash.html] 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[GitHub] [airflow] idralyuk closed pull request #5820: [AIRFLOW-5118] Add ability to specify optional components in Dataproc…

2019-08-14 Thread GitBox
idralyuk closed pull request #5820: [AIRFLOW-5118] Add ability to specify 
optional components in Dataproc…
URL: https://github.com/apache/airflow/pull/5820
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] feluelle commented on issue #5672: [AIRFLOW-5056] Add argument to filter mails in ImapHook and related operators

2019-08-14 Thread GitBox
feluelle commented on issue #5672: [AIRFLOW-5056] Add argument to filter mails 
in ImapHook and related operators
URL: https://github.com/apache/airflow/pull/5672#issuecomment-521363251
 
 
   @JCoder01 @kurtqq aren't you using the IMAP thingy?  ..and want to have a 
final look at it? :)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (AIRFLOW-5118) Airflow DataprocClusterCreateOperator does not currently support setting optional components

2019-08-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907520#comment-16907520
 ] 

ASF GitHub Bot commented on AIRFLOW-5118:
-

idralyuk commented on pull request #5820: [AIRFLOW-5118] Add ability to specify 
optional components in Dataproc…
URL: https://github.com/apache/airflow/pull/5820
 
 
   …ClusterCreateOperator
   
   ### Jira
   
   https://issues.apache.org/jira/browse/AIRFLOW-5118
   
   ### Description
   
   This PR adds ability to specify optional components in 
DataprocClusterCreateOperator
   
   For more info see 
https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#Component
   
   ### Tests
   
   One test added: it checks whether the optional components were set correctly
   
   ### Commits
   
   - [X] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [X] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   
   ### Code Quality
   
   - [X] Passes `flake8`
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Airflow DataprocClusterCreateOperator does not currently support setting 
> optional components
> 
>
> Key: AIRFLOW-5118
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5118
> Project: Apache Airflow
>  Issue Type: New Feature
>  Components: operators
>Affects Versions: 1.10.3
>Reporter: Omid Vahdaty
>Assignee: Igor
>Priority: Minor
>
> there need to be an option to install optional components via 
> DataprocClusterCreateOperator . components such as zeppelin.
> From the source code of the DataprocClusterCreateOperator[1], the only 
> software configs that can be set are the imageVersion and the properties. As 
> the Zeppelin component needs to be set through softwareConfig 
> optionalComponents[2], the DataprocClusterCreateOperator does not currently 
> support setting optional components. 
>  
> As a workaround for the time being, you could create your clusters by 
> directly using the gcloud command rather than the 
> DataprocClusterCreateOperator . Using the Airflow BashOperator[4], you can 
> execute gcloud commands that create your Dataproc cluster with the required 
> optional components. 
> [1] 
> [https://github.com/apache/airflow/blob/master/airflow/contrib/operators/dataproc_operator.py]
>  
>  [2] 
> [https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#softwareconfig]
>  
> [3] [https://airflow.apache.org/howto/operator/bash.html] 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[GitHub] [airflow] idralyuk opened a new pull request #5820: [AIRFLOW-5118] Add ability to specify optional components in Dataproc…

2019-08-14 Thread GitBox
idralyuk opened a new pull request #5820: [AIRFLOW-5118] Add ability to specify 
optional components in Dataproc…
URL: https://github.com/apache/airflow/pull/5820
 
 
   …ClusterCreateOperator
   
   ### Jira
   
   https://issues.apache.org/jira/browse/AIRFLOW-5118
   
   ### Description
   
   This PR adds ability to specify optional components in 
DataprocClusterCreateOperator
   
   For more info see 
https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#Component
   
   ### Tests
   
   One test added: it checks whether the optional components were set correctly
   
   ### Commits
   
   - [X] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [X] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   
   ### Code Quality
   
   - [X] Passes `flake8`
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] kaxil edited a comment on issue #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

2019-08-14 Thread GitBox
kaxil edited a comment on issue #5743: [AIRFLOW-5088][AIP-24] Persisting 
serialized DAG in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#issuecomment-521358954
 
 
   @zhongjiajie 
   > > Pending Issues:
   > > 
   > > * We still have the issue of `SerializedBaseOperator` being displayed in 
Graph View.
   > 
   > This is because graph.html and tree.html use `op.__class__.__name__`. 
Replaced that by op.task_type to fix it.
   
   Found this issue with that fix: We have a `BashOperator` label for each task 
instead of unique lablels. 
   
![image](https://user-images.githubusercontent.com/8811558/63044726-bd492400-bec6-11e9-9d02-a10198b72d46.png)
   
   This is causes because we are making a dict of unique TaskInstance and not 
Operator Class in **L1335**:
   
   
https://github.com/apache/airflow/blob/b814f8dfd9448ee3ceef2722c7f0291d8a680700/airflow/www/views.py#L1333-L1336
   
   Previously it was comparing Classes directly, hence it would remove 
duplicates. 
   
   
https://github.com/apache/airflow/blob/42bf5cb6782994610c722fb56adfe7b837dfeabb/airflow/www/views.py#L1322-L1328
   
   Fixing this now


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] kaxil edited a comment on issue #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

2019-08-14 Thread GitBox
kaxil edited a comment on issue #5743: [AIRFLOW-5088][AIP-24] Persisting 
serialized DAG in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#issuecomment-521358954
 
 
   @zhongjiajie 
   > > Pending Issues:
   > > 
   > > * We still have the issue of `SerializedBaseOperator` being displayed in 
Graph View.
   > 
   > This is because graph.html and tree.html use `op.__class__.__name__`. 
Replaced that by op.task_type to fix it.
   
   Found this issue with that fix: We have a `BashOperator` label for each task 
instead of unique lablels. 
   
![image](https://user-images.githubusercontent.com/8811558/63044726-bd492400-bec6-11e9-9d02-a10198b72d46.png)
   
   This is causes because we are making a dict of unique TaskInstance and not 
Operator Class in **L1335**:
   
   
https://github.com/apache/airflow/blob/b814f8dfd9448ee3ceef2722c7f0291d8a680700/airflow/www/views.py#L1333-L1336
   
   Previously it was comparing Classes directly, hence it would remove 
duplicates. 
   
   
https://github.com/apache/airflow/blob/42bf5cb6782994610c722fb56adfe7b837dfeabb/airflow/www/views.py#L1332-L1338
   
   Fixing this now


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] kaxil commented on issue #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

2019-08-14 Thread GitBox
kaxil commented on issue #5743: [AIRFLOW-5088][AIP-24] Persisting serialized 
DAG in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#issuecomment-521358954
 
 
   > > Pending Issues:
   > > 
   > > * We still have the issue of `SerializedBaseOperator` being displayed in 
Graph View.
   > 
   > This is because graph.html and tree.html use `op.__class__.__name__`. 
Replaced that by op.task_type to fix it.
   
   Found this issue with that fix: We have a `BashOperator` label for each task 
instead of unique lablels. 
   
![image](https://user-images.githubusercontent.com/8811558/63044726-bd492400-bec6-11e9-9d02-a10198b72d46.png)
   
   This is causes because we are making a dict of unique TaskInstance and not 
Operator Class in **L1335**:
   
   
https://github.com/apache/airflow/blob/b814f8dfd9448ee3ceef2722c7f0291d8a680700/airflow/www/views.py#L1333-L1336
   
   Previously it was comparing Classes directly, hence it would remove 
duplicates. 
   
   
https://github.com/apache/airflow/blob/42bf5cb6782994610c722fb56adfe7b837dfeabb/airflow/www/views.py#L1422-L1436
   
   Fixing this now


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] kaxil edited a comment on issue #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

2019-08-14 Thread GitBox
kaxil edited a comment on issue #5743: [AIRFLOW-5088][AIP-24] Persisting 
serialized DAG in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#issuecomment-521358954
 
 
   @zhongjiajie 
   > > Pending Issues:
   > > 
   > > * We still have the issue of `SerializedBaseOperator` being displayed in 
Graph View.
   > 
   > This is because graph.html and tree.html use `op.__class__.__name__`. 
Replaced that by op.task_type to fix it.
   
   Found this issue with that fix: We have a `BashOperator` label for each task 
instead of unique lablels. 
   
![image](https://user-images.githubusercontent.com/8811558/63044726-bd492400-bec6-11e9-9d02-a10198b72d46.png)
   
   This is causes because we are making a dict of unique TaskInstance and not 
Operator Class in **L1335**:
   
   
https://github.com/apache/airflow/blob/b814f8dfd9448ee3ceef2722c7f0291d8a680700/airflow/www/views.py#L1333-L1336
   
   Previously it was comparing Classes directly, hence it would remove 
duplicates. 
   
   
https://github.com/apache/airflow/blob/42bf5cb6782994610c722fb56adfe7b837dfeabb/airflow/www/views.py#L1422-L1436
   
   Fixing this now


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (AIRFLOW-5118) Airflow DataprocClusterCreateOperator does not currently support setting optional components

2019-08-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907510#comment-16907510
 ] 

ASF GitHub Bot commented on AIRFLOW-5118:
-

idralyuk commented on pull request #5820: [AIRFLOW-5118] Add ability to specify 
optional components in Dataproc…
URL: https://github.com/apache/airflow/pull/5820
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Airflow DataprocClusterCreateOperator does not currently support setting 
> optional components
> 
>
> Key: AIRFLOW-5118
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5118
> Project: Apache Airflow
>  Issue Type: New Feature
>  Components: operators
>Affects Versions: 1.10.3
>Reporter: Omid Vahdaty
>Assignee: Igor
>Priority: Minor
>
> there need to be an option to install optional components via 
> DataprocClusterCreateOperator . components such as zeppelin.
> From the source code of the DataprocClusterCreateOperator[1], the only 
> software configs that can be set are the imageVersion and the properties. As 
> the Zeppelin component needs to be set through softwareConfig 
> optionalComponents[2], the DataprocClusterCreateOperator does not currently 
> support setting optional components. 
>  
> As a workaround for the time being, you could create your clusters by 
> directly using the gcloud command rather than the 
> DataprocClusterCreateOperator . Using the Airflow BashOperator[4], you can 
> execute gcloud commands that create your Dataproc cluster with the required 
> optional components. 
> [1] 
> [https://github.com/apache/airflow/blob/master/airflow/contrib/operators/dataproc_operator.py]
>  
>  [2] 
> [https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#softwareconfig]
>  
> [3] [https://airflow.apache.org/howto/operator/bash.html] 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[GitHub] [airflow] rasmi commented on issue #5720: [AIRFLOW-5099][WIP-DONT-MERGE] Implement Google Cloud AutoML operators

2019-08-14 Thread GitBox
rasmi commented on issue #5720: [AIRFLOW-5099][WIP-DONT-MERGE] Implement Google 
Cloud AutoML operators
URL: https://github.com/apache/airflow/pull/5720#issuecomment-521356636
 
 
   No review comments here, I'm just excited for this to be merged -- thank you 
all for your work!


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] idralyuk closed pull request #5820: [AIRFLOW-5118] Add ability to specify optional components in Dataproc…

2019-08-14 Thread GitBox
idralyuk closed pull request #5820: [AIRFLOW-5118] Add ability to specify 
optional components in Dataproc…
URL: https://github.com/apache/airflow/pull/5820
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (AIRFLOW-5215) Add sidecar container support to Pod object

2019-08-14 Thread Philippe Gagnon (JIRA)
Philippe Gagnon created AIRFLOW-5215:


 Summary: Add sidecar container support to Pod object
 Key: AIRFLOW-5215
 URL: https://issues.apache.org/jira/browse/AIRFLOW-5215
 Project: Apache Airflow
  Issue Type: New Feature
  Components: scheduler
Affects Versions: 2.0.0
Reporter: Philippe Gagnon
Assignee: Philippe Gagnon


Add sidecar container support to Pod object.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (AIRFLOW-5118) Airflow DataprocClusterCreateOperator does not currently support setting optional components

2019-08-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907469#comment-16907469
 ] 

ASF GitHub Bot commented on AIRFLOW-5118:
-

idralyuk commented on pull request #5820: [AIRFLOW-5118] Add ability to specify 
optional components in Dataproc…
URL: https://github.com/apache/airflow/pull/5820
 
 
   …ClusterCreateOperator
   
   ### Jira
   
   https://issues.apache.org/jira/browse/AIRFLOW-5118
   
   ### Description
   
   This PR adds ability to specify optional components in 
DataprocClusterCreateOperator
   
   For more info see 
https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#Component
   
   ### Tests
   
   One test added: it checks whether the optional components were set correctly
   
   ### Commits
   
   - [X] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [X] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   
   ### Code Quality
   
   - [X] Passes `flake8`
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Airflow DataprocClusterCreateOperator does not currently support setting 
> optional components
> 
>
> Key: AIRFLOW-5118
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5118
> Project: Apache Airflow
>  Issue Type: New Feature
>  Components: operators
>Affects Versions: 1.10.3
>Reporter: Omid Vahdaty
>Assignee: Igor
>Priority: Minor
>
> there need to be an option to install optional components via 
> DataprocClusterCreateOperator . components such as zeppelin.
> From the source code of the DataprocClusterCreateOperator[1], the only 
> software configs that can be set are the imageVersion and the properties. As 
> the Zeppelin component needs to be set through softwareConfig 
> optionalComponents[2], the DataprocClusterCreateOperator does not currently 
> support setting optional components. 
>  
> As a workaround for the time being, you could create your clusters by 
> directly using the gcloud command rather than the 
> DataprocClusterCreateOperator . Using the Airflow BashOperator[4], you can 
> execute gcloud commands that create your Dataproc cluster with the required 
> optional components. 
> [1] 
> [https://github.com/apache/airflow/blob/master/airflow/contrib/operators/dataproc_operator.py]
>  
>  [2] 
> [https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#softwareconfig]
>  
> [3] [https://airflow.apache.org/howto/operator/bash.html] 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[GitHub] [airflow] idralyuk opened a new pull request #5820: [AIRFLOW-5118] Add ability to specify optional components in Dataproc…

2019-08-14 Thread GitBox
idralyuk opened a new pull request #5820: [AIRFLOW-5118] Add ability to specify 
optional components in Dataproc…
URL: https://github.com/apache/airflow/pull/5820
 
 
   …ClusterCreateOperator
   
   ### Jira
   
   https://issues.apache.org/jira/browse/AIRFLOW-5118
   
   ### Description
   
   This PR adds ability to specify optional components in 
DataprocClusterCreateOperator
   
   For more info see 
https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#Component
   
   ### Tests
   
   One test added: it checks whether the optional components were set correctly
   
   ### Commits
   
   - [X] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [X] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   
   ### Code Quality
   
   - [X] Passes `flake8`
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Closed] (AIRFLOW-5183) Preprare documentation for new GCP import paths

2019-08-14 Thread Kamil Bregula (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kamil Bregula closed AIRFLOW-5183.
--
   Resolution: Fixed
Fix Version/s: 2.0.0

> Preprare documentation for new GCP import paths
> ---
>
> Key: AIRFLOW-5183
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5183
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: gcp
>Affects Versions: 2.0.0
>Reporter: Tomasz Urbaszek
>Priority: Major
> Fix For: 2.0.0
>
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (AIRFLOW-5183) Preprare documentation for new GCP import paths

2019-08-14 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907456#comment-16907456
 ] 

ASF subversion and git services commented on AIRFLOW-5183:
--

Commit 40745aa225ae14ad700e2da1f421cd5d0df8e292 in airflow's branch 
refs/heads/master from Tomek
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=40745aa ]

[AIRFLOW-5183] Preprare documentation for new GCP import paths (#5791)



> Preprare documentation for new GCP import paths
> ---
>
> Key: AIRFLOW-5183
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5183
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: gcp
>Affects Versions: 2.0.0
>Reporter: Tomasz Urbaszek
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (AIRFLOW-5183) Preprare documentation for new GCP import paths

2019-08-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5183?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907455#comment-16907455
 ] 

ASF GitHub Bot commented on AIRFLOW-5183:
-

mik-laj commented on pull request #5791: [AIRFLOW-5183] Preprare documentation 
for new GCP import paths
URL: https://github.com/apache/airflow/pull/5791
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Preprare documentation for new GCP import paths
> ---
>
> Key: AIRFLOW-5183
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5183
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: gcp
>Affects Versions: 2.0.0
>Reporter: Tomasz Urbaszek
>Priority: Major
>




--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[GitHub] [airflow] mik-laj merged pull request #5791: [AIRFLOW-5183] Preprare documentation for new GCP import paths

2019-08-14 Thread GitBox
mik-laj merged pull request #5791: [AIRFLOW-5183] Preprare documentation for 
new GCP import paths
URL: https://github.com/apache/airflow/pull/5791
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Assigned] (AIRFLOW-5118) Airflow DataprocClusterCreateOperator does not currently support setting optional components

2019-08-14 Thread Igor (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5118?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Igor reassigned AIRFLOW-5118:
-

Assignee: Igor  (was: Kaxil Naik)

> Airflow DataprocClusterCreateOperator does not currently support setting 
> optional components
> 
>
> Key: AIRFLOW-5118
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5118
> Project: Apache Airflow
>  Issue Type: New Feature
>  Components: operators
>Affects Versions: 1.10.3
>Reporter: Omid Vahdaty
>Assignee: Igor
>Priority: Minor
>
> there need to be an option to install optional components via 
> DataprocClusterCreateOperator . components such as zeppelin.
> From the source code of the DataprocClusterCreateOperator[1], the only 
> software configs that can be set are the imageVersion and the properties. As 
> the Zeppelin component needs to be set through softwareConfig 
> optionalComponents[2], the DataprocClusterCreateOperator does not currently 
> support setting optional components. 
>  
> As a workaround for the time being, you could create your clusters by 
> directly using the gcloud command rather than the 
> DataprocClusterCreateOperator . Using the Airflow BashOperator[4], you can 
> execute gcloud commands that create your Dataproc cluster with the required 
> optional components. 
> [1] 
> [https://github.com/apache/airflow/blob/master/airflow/contrib/operators/dataproc_operator.py]
>  
>  [2] 
> [https://cloud.google.com/dataproc/docs/reference/rest/v1/ClusterConfig#softwareconfig]
>  
> [3] [https://airflow.apache.org/howto/operator/bash.html] 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[GitHub] [airflow] kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

2019-08-14 Thread GitBox
kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] 
Persisting serialized DAG in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#discussion_r313965300
 
 

 ##
 File path: airflow/api/common/experimental/delete_dag.py
 ##
 @@ -45,6 +49,11 @@ def delete_dag(dag_id: str, keep_records_in_log: bool = 
True, session=None) -> i
 raise DagFileExists("Dag id {} is still in DagBag. "
 "Remove the DAG file first: {}".format(dag_id, 
dag.fileloc))
 
+# Scheduler removes DAGs without files from serialized_dag table every 
dag_dir_list_interval.
+# There may be a lag, so explicitly removes serialized DAG here.
+if DAGCACHED_ENABLED and SerializedDagModel.has_dag(dag_id):
+SerializedDagModel.remove_dag(dag_id)
 
 Review comment:
   Updated in 
https://github.com/apache/airflow/pull/5743/commits/b814f8dfd9448ee3ceef2722c7f0291d8a680700


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Commented] (AIRFLOW-5147) Annotations for k8s executors should support extended alphabet (like '/'))

2019-08-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5147?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907384#comment-16907384
 ] 

ASF GitHub Bot commented on AIRFLOW-5147:
-

andrei-l commented on pull request #5819: [AIRFLOW-5147] extended character set 
for for k8s worker pods annotations
URL: https://github.com/apache/airflow/pull/5819
 
 
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references 
them in the PR title.
 - https://issues.apache.org/jira/browse/AIRFLOW-5147
   
   ### Description
   
   - [ ] Here are some details about my PR, including screenshots of any UI 
changes:
   This PR fixes the previous solution 
(https://github.com/apache/airflow/pull/4589) of providing k8s annotations to 
workers created by k8s executor. Previously each annotation key had to be 
declared as part of the airflow config key which implied having some 
limitations on it (like it could not contatin `/` character). 
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
```
   executors.TestKubeConfig.test_kube_config_worker_annotations_properly_parsed
   executors.TestKubeConfig.test_kube_config_no_worker_annotations
   ```
   and updates
   ```
   
executors.TestKubernetesWorkerConfiguration.test_make_pod_with_empty_executor_config
   ```
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   
   ### Code Quality
   
   - [ ] Passes `flake8`
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Annotations for k8s executors should support extended alphabet (like '/')) 
> ---
>
> Key: AIRFLOW-5147
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5147
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: executor-kubernetes, executors
>Affects Versions: 1.10.3, 1.10.4
>Reporter: Andrei Loginov
>Assignee: Daniel Imberman
>Priority: Major
>
> The fix to introduce k8s annotations for executors 
> ([https://github.com/apache/airflow/pull/4589] for 
> https://issues.apache.org/jira/browse/AIRFLOW-3766) limited the character set 
> allowed for the annotation key to [-._a-zA-Z0-9] set. However many 
> annotations contain `/` in it, for example: 
> {code:java}
> injector.tumblr.com/request{code}
>  or
> {code:java}
> iam.amazonaws.com/role{code}
> Which would not be allowed in the current solution.
>  
> I believe original solution should be completely revisited. And instead of 
> using a separate *kubernetes_annotations* section there should be a key which 
> will contain a set of key:value annotations in some format. E.g. json:
> {code:java}
> [kubernetes]
> annotations = { "iam.amazonaws.com/role": 
> "arn:aws:iam:::role/some-role-CKU5HL9BIPXG", "some-other-anno-key": 
> "some/value" }
> {code}
>  
> Supported character set for annotations:
> https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/#syntax-and-character-set



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[GitHub] [airflow] andrei-l opened a new pull request #5819: [AIRFLOW-5147] extended character set for for k8s worker pods annotations

2019-08-14 Thread GitBox
andrei-l opened a new pull request #5819: [AIRFLOW-5147] extended character set 
for for k8s worker pods annotations
URL: https://github.com/apache/airflow/pull/5819
 
 
   Make sure you have checked _all_ steps below.
   
   ### Jira
   
   - [ ] My PR addresses the following [Airflow 
Jira](https://issues.apache.org/jira/browse/AIRFLOW/) issues and references 
them in the PR title.
 - https://issues.apache.org/jira/browse/AIRFLOW-5147
   
   ### Description
   
   - [ ] Here are some details about my PR, including screenshots of any UI 
changes:
   This PR fixes the previous solution 
(https://github.com/apache/airflow/pull/4589) of providing k8s annotations to 
workers created by k8s executor. Previously each annotation key had to be 
declared as part of the airflow config key which implied having some 
limitations on it (like it could not contatin `/` character). 
   
   ### Tests
   
   - [ ] My PR adds the following unit tests __OR__ does not need testing for 
this extremely good reason:
```
   executors.TestKubeConfig.test_kube_config_worker_annotations_properly_parsed
   executors.TestKubeConfig.test_kube_config_no_worker_annotations
   ```
   and updates
   ```
   
executors.TestKubernetesWorkerConfiguration.test_make_pod_with_empty_executor_config
   ```
   
   ### Commits
   
   - [ ] My commits all reference Jira issues in their subject lines, and I 
have squashed multiple commits if they address the same issue. In addition, my 
commits follow the guidelines from "[How to write a good git commit 
message](http://chris.beams.io/posts/git-commit/)":
 1. Subject is separated from body by a blank line
 1. Subject is limited to 50 characters (not including Jira issue reference)
 1. Subject does not end with a period
 1. Subject uses the imperative mood ("add", not "adding")
 1. Body wraps at 72 characters
 1. Body explains "what" and "why", not "how"
   
   ### Documentation
   
   - [ ] In case of new functionality, my PR adds documentation that describes 
how to use it.
 - All the public functions and the classes in the PR contain docstrings 
that explain what it does
 - If you implement backwards incompatible changes, please leave a note in 
the [Updating.md](https://github.com/apache/airflow/blob/master/UPDATING.md) so 
we can assign it to a appropriate release
   
   ### Code Quality
   
   - [ ] Passes `flake8`
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[GitHub] [airflow] danfrankj commented on issue #5815: [AIRFLOW-5210] Make finding template files more efficient

2019-08-14 Thread GitBox
danfrankj commented on issue #5815: [AIRFLOW-5210] Make finding template files 
more efficient
URL: https://github.com/apache/airflow/pull/5815#issuecomment-521300533
 
 
   @BasPH was something wrong with this PR? - I'm seeing a message above about 
a revert


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Assigned] (AIRFLOW-5213) DockerOperator failing when the docker default logging drivers are other than 'journald','json-file'

2019-08-14 Thread venkata Bonu (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

venkata Bonu reassigned AIRFLOW-5213:
-

Assignee: (was: venkata Bonu)

> DockerOperator failing when the docker default logging drivers are other than 
> 'journald','json-file'
> 
>
> Key: AIRFLOW-5213
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5213
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: DAG, operators
>Affects Versions: 1.10.4
>Reporter: venkata Bonu
>Priority: Major
>  Labels: easyfix
>
> Background:
> Docker can be configured with multiple logging drivers.
>  * syslog
>  * local
>  * json - file
>  * journald
>  * local
>  * gelf
>  * fluentd
>  * awslogs
>  * splunk
>  * etwlogs
>  * gcplogs
>  * Logentries
> But reading docker logs is supported only with drivers local , json-file , 
> journald
> Docker documentation: 
> [https://docs.docker.com/config/containers/logging/configure/]
>  
> Description:
> When a docker is configured with a logging driver other than local , 
> json-file , jourmald , Airflow Tasks which are using DockerOperator are 
> failing with an error
> _docker.errors.APIError: 501 Server Error: Not Implemented ("configured 
> logging driver does not support reading")_
> Issue exists in the below lines of the code when the operator is trying to 
> read the logs by attaching the container.
> ```
> {code:python}
> line = ''
> for line in self.cli.attach(container=self.container['Id'], stdout=True, 
> stderr=True, stream=True):  
>     line = line.strip()
>     if hasattr(line, 'decode'):
>        line = line.decode('utf-8')
> self.log.info(line)
> {code}
> ```
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (AIRFLOW-5210) Resolving Template Files for large DAGs hurts performance

2019-08-14 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5210?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907336#comment-16907336
 ] 

ASF subversion and git services commented on AIRFLOW-5210:
--

Commit 577970c210c9160be9e2382ecfd3ae79b01e4d88 in airflow's branch 
refs/heads/revert-5815-df_resolve_template_files from Bas Harenslak
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=577970c ]

Revert "[AIRFLOW-5210] Make finding template files more efficient (#5815)"

This reverts commit eeac82318a6440b2d65f9a35b3437b91813945f4.


> Resolving Template Files for large DAGs hurts performance 
> --
>
> Key: AIRFLOW-5210
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5210
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: DAG
>Affects Versions: 1.10.4
>Reporter: Daniel Frank
>Priority: Major
> Fix For: 1.10.5
>
>
> During task execution,  "resolve_template_files" runs for all tasks in a 
> given DAG. For large DAGs this takes a long time and is not necessary for 
> tasks that do not use the template_ext field 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (AIRFLOW-5211) Add pass_value to template_fields -- BigQueryValueCheckOperator

2019-08-14 Thread Damon Liao (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5211?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Damon Liao resolved AIRFLOW-5211.
-
Resolution: Fixed

> Add pass_value to template_fields -- BigQueryValueCheckOperator
> ---
>
> Key: AIRFLOW-5211
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5211
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: contrib
>Affects Versions: 1.10.4
>Reporter: Damon Liao
>Assignee: Damon Liao
>Priority: Minor
> Fix For: 1.10.5, 1.10.4
>
>
> There's use cases to fill *pass_value* from *XCom* when use 
> *BigQueryValueCheckOperator*, so add pass_value to template_fields.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (AIRFLOW-5182) "KubernetesOperator" isn't implemented

2019-08-14 Thread Ash Berlin-Taylor (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5182?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907329#comment-16907329
 ] 

Ash Berlin-Taylor commented on AIRFLOW-5182:


There never has been a KubernetesOperator, and the import is otherwise unused 
in the doc, so that line should just be removed from the docs.

> "KubernetesOperator" isn't implemented
> --
>
> Key: AIRFLOW-5182
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5182
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: documentation
>Affects Versions: 1.10.3, 1.10.4
>Reporter: Esfahan
>Priority: Minor
>
> h2. Problem
> I encountered the following error with `KubernetesOperator`.
> {code:java}
> Broken DAG: [/root/airflow/dags/sample_k8s.py] cannot import name 
> KubernetesOperator
> {code}
> h2. Investigation
> The following document is written a sample code to describe how to use 
> Kubernetes Executor.
>  [https://airflow.apache.org/kubernetes.html#kubernetes-operator]
> There is a line `import KubernetesOperator`, but I think it isn't implemented 
> on airflow and it isn't used in this script.
> {code:java}
> from airflow.contrib.operators import KubernetesOperator
> {code}
> I couldn't find `KubernetesOperator` in the following dirs.
>  * [https://github.com/apache/airflow/tree/1.10.4/airflow/contrib/operators]
>  * [https://github.com/apache/airflow/tree/1.10.4/airflow/operators]
> Could you check it?



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (AIRFLOW-5214) Airflow leaves too many TIME_WAIT TCP connections

2019-08-14 Thread Oliver Ricken (JIRA)
Oliver Ricken created AIRFLOW-5214:
--

 Summary: Airflow leaves too many TIME_WAIT TCP connections
 Key: AIRFLOW-5214
 URL: https://issues.apache.org/jira/browse/AIRFLOW-5214
 Project: Apache Airflow
  Issue Type: Bug
  Components: DagRun, database
Affects Versions: 1.10.4, 1.10.2
 Environment: CentOS 7, Airflow 1.10.4, Maria DB
Reporter: Oliver Ricken


Dear experts,

in Airflow version 1.10.2 as well as 1.10.4, we experience a severe problem 
with the limitation of the number of concurrent tasks.

We observe that for more than 8 tasks being started and executed in parallel, 
that the majority of those tasks fails with the error "Can't connect to MySQL 
server" and error code 2006(99). This error code boils down to "Cannot bind 
socket to resource", which is why we started looking into the TCP conenctions 
of our Airflow host (a single node that hosts the webserver, scheduler and 
worker).

When the 8 tasks are simultaneously running, we observe more than 15,000 
TIME_WAIT connections while less than 50 are established. Given, that the 
number of available ports is somewhat smaller than 30,000, this large number of 
blocked but unused TCP connections would explain the failing of further task 
executions.
Can anyone explain how these many open connections blocking ports/sockets come 
about? Given that we have connection pooling enabled, we do not see any 
explanation yet.

Your help is very much appreciated, this issue strongly limits our current 
performance!

Cheers

Oliver



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[GitHub] [airflow] kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] Persisting serialized DAG in DB for webserver scalability

2019-08-14 Thread GitBox
kaxil commented on a change in pull request #5743: [AIRFLOW-5088][AIP-24] 
Persisting serialized DAG in DB for webserver scalability
URL: https://github.com/apache/airflow/pull/5743#discussion_r313909969
 
 

 ##
 File path: airflow/dag/serialization/serialized_baseoperator.py
 ##
 @@ -45,6 +45,8 @@ def __init__(self, *args, **kwargs):
 self.ui_color = BaseOperator.ui_color
 self.ui_fgcolor = BaseOperator.ui_fgcolor
 self.template_fields = BaseOperator.template_fields
+# Not None for SubDagOperator.
 
 Review comment:
   Added in tests 
https://github.com/apache/airflow/pull/5743/commits/c68ee581e0534b91e41f5e696394a9b1d6e12baa


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Updated] (AIRFLOW-5213) DockerOperator failing when the docker default logging drivers are other than 'journald','json-file'

2019-08-14 Thread venkata Bonu (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

venkata Bonu updated AIRFLOW-5213:
--
Attachment: (was: Screen Shot 2019-08-14 at 7.08.44 AM.png)

> DockerOperator failing when the docker default logging drivers are other than 
> 'journald','json-file'
> 
>
> Key: AIRFLOW-5213
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5213
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: DAG, operators
>Affects Versions: 1.10.4
>Reporter: venkata Bonu
>Assignee: venkata Bonu
>Priority: Major
>  Labels: easyfix
>
> Background:
> Docker can be configured with multiple logging drivers.
>  * syslog
>  * local
>  * json - file
>  * journald
>  * local
>  * gelf
>  * fluentd
>  * awslogs
>  * splunk
>  * etwlogs
>  * gcplogs
>  * Logentries
> But reading docker logs is supported only with drivers local , json-file , 
> journald
> Docker documentation: 
> [https://docs.docker.com/config/containers/logging/configure/]
>  
> Description:
> When a docker is configured with a logging driver other than local , 
> json-file , jourmald , Airflow Tasks which are using DockerOperator are 
> failing with an error
> _docker.errors.APIError: 501 Server Error: Not Implemented ("configured 
> logging driver does not support reading")_
> Issue exists in the below lines of the code when the operator is trying to 
> read the logs by attaching the container.
> ```
> {code:python}
> line = ''
> for line in self.cli.attach(container=self.container['Id'], stdout=True, 
> stderr=True, stream=True):  
>     line = line.strip()
>     if hasattr(line, 'decode'):
>        line = line.decode('utf-8')
> self.log.info(line)
> {code}
> ```
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (AIRFLOW-5213) DockerOperator failing when the docker default logging drivers are other than 'journald','json-file'

2019-08-14 Thread venkata Bonu (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

venkata Bonu updated AIRFLOW-5213:
--
Attachment: (was: Screen Shot 2019-08-14 at 7.10.01 AM.png)

> DockerOperator failing when the docker default logging drivers are other than 
> 'journald','json-file'
> 
>
> Key: AIRFLOW-5213
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5213
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: DAG, operators
>Affects Versions: 1.10.4
>Reporter: venkata Bonu
>Assignee: venkata Bonu
>Priority: Major
>  Labels: easyfix
>
> Background:
> Docker can be configured with multiple logging drivers.
>  * syslog
>  * local
>  * json - file
>  * journald
>  * local
>  * gelf
>  * fluentd
>  * awslogs
>  * splunk
>  * etwlogs
>  * gcplogs
>  * Logentries
> But reading docker logs is supported only with drivers local , json-file , 
> journald
> Docker documentation: 
> [https://docs.docker.com/config/containers/logging/configure/]
>  
> Description:
> When a docker is configured with a logging driver other than local , 
> json-file , jourmald , Airflow Tasks which are using DockerOperator are 
> failing with an error
> _docker.errors.APIError: 501 Server Error: Not Implemented ("configured 
> logging driver does not support reading")_
> Issue exists in the below lines of the code when the operator is trying to 
> read the logs by attaching the container.
> ```
> {code:python}
> line = ''
> for line in self.cli.attach(container=self.container['Id'], stdout=True, 
> stderr=True, stream=True):  
>     line = line.strip()
>     if hasattr(line, 'decode'):
>        line = line.decode('utf-8')
> self.log.info(line)
> {code}
> ```
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (AIRFLOW-5213) DockerOperator failing when the docker default logging drivers are other than 'journald','json-file'

2019-08-14 Thread venkata Bonu (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

venkata Bonu updated AIRFLOW-5213:
--
Attachment: Screen Shot 2019-08-14 at 7.08.44 AM.png

> DockerOperator failing when the docker default logging drivers are other than 
> 'journald','json-file'
> 
>
> Key: AIRFLOW-5213
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5213
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: DAG, operators
>Affects Versions: 1.10.4
>Reporter: venkata Bonu
>Assignee: venkata Bonu
>Priority: Major
>  Labels: easyfix
> Attachments: Screen Shot 2019-08-14 at 7.08.44 AM.png, Screen Shot 
> 2019-08-14 at 7.10.01 AM.png
>
>
> Background:
> Docker can be configured with multiple logging drivers.
>  * syslog
>  * local
>  * json - file
>  * journald
>  * local
>  * gelf
>  * fluentd
>  * awslogs
>  * splunk
>  * etwlogs
>  * gcplogs
>  * Logentries
> But reading docker logs is supported only with drivers local , json-file , 
> journald
> Docker documentation: 
> [https://docs.docker.com/config/containers/logging/configure/]
>  
> Description:
> When a docker is configured with a logging driver other than local , 
> json-file , jourmald , Airflow Tasks which are using DockerOperator are 
> failing with an error
> _docker.errors.APIError: 501 Server Error: Not Implemented ("configured 
> logging driver does not support reading")_
> Issue exists in the below lines of the code when the operator is trying to 
> read the logs by attaching the container.
> ```
> {code:python}
> line = ''
> for line in self.cli.attach(container=self.container['Id'], stdout=True, 
> stderr=True, stream=True):  
>     line = line.strip()
>     if hasattr(line, 'decode'):
>        line = line.decode('utf-8')
> self.log.info(line)
> {code}
> ```
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (AIRFLOW-5213) DockerOperator failing when the docker default logging drivers are other than 'journald','json-file'

2019-08-14 Thread venkata Bonu (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

venkata Bonu updated AIRFLOW-5213:
--
Attachment: Screen Shot 2019-08-14 at 7.09.27 AM.png

> DockerOperator failing when the docker default logging drivers are other than 
> 'journald','json-file'
> 
>
> Key: AIRFLOW-5213
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5213
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: DAG, operators
>Affects Versions: 1.10.4
>Reporter: venkata Bonu
>Assignee: venkata Bonu
>Priority: Major
>  Labels: easyfix
> Attachments: Screen Shot 2019-08-14 at 7.08.44 AM.png, Screen Shot 
> 2019-08-14 at 7.10.01 AM.png
>
>
> Background:
> Docker can be configured with multiple logging drivers.
>  * syslog
>  * local
>  * json - file
>  * journald
>  * local
>  * gelf
>  * fluentd
>  * awslogs
>  * splunk
>  * etwlogs
>  * gcplogs
>  * Logentries
> But reading docker logs is supported only with drivers local , json-file , 
> journald
> Docker documentation: 
> [https://docs.docker.com/config/containers/logging/configure/]
>  
> Description:
> When a docker is configured with a logging driver other than local , 
> json-file , jourmald , Airflow Tasks which are using DockerOperator are 
> failing with an error
> _docker.errors.APIError: 501 Server Error: Not Implemented ("configured 
> logging driver does not support reading")_
> Issue exists in the below lines of the code when the operator is trying to 
> read the logs by attaching the container.
> ```
> {code:python}
> line = ''
> for line in self.cli.attach(container=self.container['Id'], stdout=True, 
> stderr=True, stream=True):  
>     line = line.strip()
>     if hasattr(line, 'decode'):
>        line = line.decode('utf-8')
> self.log.info(line)
> {code}
> ```
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Updated] (AIRFLOW-5213) DockerOperator failing when the docker default logging drivers are other than 'journald','json-file'

2019-08-14 Thread venkata Bonu (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5213?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

venkata Bonu updated AIRFLOW-5213:
--
Attachment: (was: Screen Shot 2019-08-14 at 7.09.27 AM.png)

> DockerOperator failing when the docker default logging drivers are other than 
> 'journald','json-file'
> 
>
> Key: AIRFLOW-5213
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5213
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: DAG, operators
>Affects Versions: 1.10.4
>Reporter: venkata Bonu
>Assignee: venkata Bonu
>Priority: Major
>  Labels: easyfix
> Attachments: Screen Shot 2019-08-14 at 7.08.44 AM.png, Screen Shot 
> 2019-08-14 at 7.10.01 AM.png
>
>
> Background:
> Docker can be configured with multiple logging drivers.
>  * syslog
>  * local
>  * json - file
>  * journald
>  * local
>  * gelf
>  * fluentd
>  * awslogs
>  * splunk
>  * etwlogs
>  * gcplogs
>  * Logentries
> But reading docker logs is supported only with drivers local , json-file , 
> journald
> Docker documentation: 
> [https://docs.docker.com/config/containers/logging/configure/]
>  
> Description:
> When a docker is configured with a logging driver other than local , 
> json-file , jourmald , Airflow Tasks which are using DockerOperator are 
> failing with an error
> _docker.errors.APIError: 501 Server Error: Not Implemented ("configured 
> logging driver does not support reading")_
> Issue exists in the below lines of the code when the operator is trying to 
> read the logs by attaching the container.
> ```
> {code:python}
> line = ''
> for line in self.cli.attach(container=self.container['Id'], stdout=True, 
> stderr=True, stream=True):  
>     line = line.strip()
>     if hasattr(line, 'decode'):
>        line = line.decode('utf-8')
> self.log.info(line)
> {code}
> ```
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[GitHub] [airflow] potiuk commented on a change in pull request #5808: [AIRFLOW-5205] Check xml files depends on AIRFLOW-5161, AIRFLOW-5170, AIRFLOW-5180, AIRFLOW-5204,

2019-08-14 Thread GitBox
potiuk commented on a change in pull request #5808:  [AIRFLOW-5205] Check xml 
files depends on  AIRFLOW-5161,  AIRFLOW-5170,  AIRFLOW-5180,  AIRFLOW-5204, 
URL: https://github.com/apache/airflow/pull/5808#discussion_r313897477
 
 

 ##
 File path: airflow/_vendor/slugify/slugify.py
 ##
 @@ -1,3 +1,6 @@
+# -*- coding: utf-8 -*-
+# pylint: skip-file
+"""Slugify !"""
 
 Review comment:
   Yeah. I will split those and exclude vendor from the original change. I 
thought I did that everywhere but I might have corrected vendor accidentally.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


[jira] [Created] (AIRFLOW-5213) DockerOperator failing when the docker default logging drivers are other than 'journald','json-file'

2019-08-14 Thread venkata Bonu (JIRA)
venkata Bonu created AIRFLOW-5213:
-

 Summary: DockerOperator failing when the docker default logging 
drivers are other than 'journald','json-file'
 Key: AIRFLOW-5213
 URL: https://issues.apache.org/jira/browse/AIRFLOW-5213
 Project: Apache Airflow
  Issue Type: Bug
  Components: DAG, operators
Affects Versions: 1.10.4
Reporter: venkata Bonu
Assignee: venkata Bonu
 Attachments: Screen Shot 2019-08-14 at 7.10.01 AM.png

Background:

Docker can be configured with multiple logging drivers.
 * syslog
 * local
 * json - file
 * journald
 * local
 * gelf
 * fluentd
 * awslogs
 * splunk
 * etwlogs
 * gcplogs
 * Logentries

But reading docker logs is supported only with drivers local , json-file , 
journald

Docker documentation: 
[https://docs.docker.com/config/containers/logging/configure/]

 

Description:

When a docker is configured with a logging driver other than local , json-file 
, jourmald , Airflow Tasks which are using DockerOperator are failing with an 
error

_docker.errors.APIError: 501 Server Error: Not Implemented ("configured logging 
driver does not support reading")_

Issue exists in the below lines of the code when the operator is trying to read 
the logs by attaching the container.

```
{code:python}
line = ''
for line in self.cli.attach(container=self.container['Id'], stdout=True, 
stderr=True, stream=True):  
    line = line.strip()
    if hasattr(line, 'decode'):
       line = line.decode('utf-8')
self.log.info(line)
{code}
```

 

 



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (AIRFLOW-5179) Top level __init__.py breaks imports

2019-08-14 Thread ASF subversion and git services (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907291#comment-16907291
 ] 

ASF subversion and git services commented on AIRFLOW-5179:
--

Commit 4e03d2390fc77e6a911fb97d8585fad482c589a6 in airflow's branch 
refs/heads/master from Ash Berlin-Taylor
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=4e03d23 ]

[AIRFLOW-5179] Remove top level __init__.py (#5818)

The recent commit 3724c2aa to master introduced a __init__.py file in
the project root folder, which basically breaks all imports in local
development (`pip install -e .`) as it turns the project root into a
package.

[ci skip]

> Top level __init__.py breaks imports
> 
>
> Key: AIRFLOW-5179
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5179
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: build
>Affects Versions: 2.0.0
>Reporter: Cedrik Neumann
>Assignee: Ash Berlin-Taylor
>Priority: Blocker
>
> The recent commit 
> [3724c2aaf4cfee4a60f6c7231777bfb256090c7c|https://github.com/apache/airflow/commit/3724c2aaf4cfee4a60f6c7231777bfb256090c7c]
>  to master introduced a {{__init__.py}} file in the project root folder, 
> which basically breaks all imports in local development ({{pip install -e 
> .}}) as it turns the project root into a package.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Resolved] (AIRFLOW-5179) Top level __init__.py breaks imports

2019-08-14 Thread Ash Berlin-Taylor (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ash Berlin-Taylor resolved AIRFLOW-5179.

Resolution: Fixed

> Top level __init__.py breaks imports
> 
>
> Key: AIRFLOW-5179
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5179
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: build
>Affects Versions: 2.0.0
>Reporter: Cedrik Neumann
>Assignee: Ash Berlin-Taylor
>Priority: Blocker
>
> The recent commit 
> [3724c2aaf4cfee4a60f6c7231777bfb256090c7c|https://github.com/apache/airflow/commit/3724c2aaf4cfee4a60f6c7231777bfb256090c7c]
>  to master introduced a {{__init__.py}} file in the project root folder, 
> which basically breaks all imports in local development ({{pip install -e 
> .}}) as it turns the project root into a package.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (AIRFLOW-5179) Top level __init__.py breaks imports

2019-08-14 Thread ASF GitHub Bot (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-5179?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16907290#comment-16907290
 ] 

ASF GitHub Bot commented on AIRFLOW-5179:
-

ashb commented on pull request #5818: [AIRFLOW-5179] Remove top level 
__init__.py
URL: https://github.com/apache/airflow/pull/5818
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Top level __init__.py breaks imports
> 
>
> Key: AIRFLOW-5179
> URL: https://issues.apache.org/jira/browse/AIRFLOW-5179
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: build
>Affects Versions: 2.0.0
>Reporter: Cedrik Neumann
>Assignee: Ash Berlin-Taylor
>Priority: Blocker
>
> The recent commit 
> [3724c2aaf4cfee4a60f6c7231777bfb256090c7c|https://github.com/apache/airflow/commit/3724c2aaf4cfee4a60f6c7231777bfb256090c7c]
>  to master introduced a {{__init__.py}} file in the project root folder, 
> which basically breaks all imports in local development ({{pip install -e 
> .}}) as it turns the project root into a package.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[GitHub] [airflow] ashb merged pull request #5818: [AIRFLOW-5179] Remove top level __init__.py

2019-08-14 Thread GitBox
ashb merged pull request #5818: [AIRFLOW-5179] Remove top level __init__.py
URL: https://github.com/apache/airflow/pull/5818
 
 
   


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services


  1   2   >