[GitHub] [airflow] boring-cyborg[bot] commented on pull request #10820: Update qubole_hook do not remove pool as an arg for qubole operator

2020-09-08 Thread GitBox


boring-cyborg[bot] commented on pull request #10820:
URL: https://github.com/apache/airflow/pull/10820#issuecomment-689319075


   Congratulations on your first Pull Request and welcome to the Apache Airflow 
community! If you have any issues or are unsure about any anything please check 
our Contribution Guide 
(https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst)
   Here are some useful points:
   - Pay attention to the quality of your code (flake8, pylint and type 
annotations). Our [pre-commits]( 
https://github.com/apache/airflow/blob/master/STATIC_CODE_CHECKS.rst#prerequisites-for-pre-commit-hooks)
 will help you with that.
   - In case of a new feature add useful documentation (in docstrings or in 
`docs/` directory). Adding a new operator? Check this short 
[guide](https://github.com/apache/airflow/blob/master/docs/howto/custom-operator.rst)
 Consider adding an example DAG that shows how users should use it.
   - Consider using [Breeze 
environment](https://github.com/apache/airflow/blob/master/BREEZE.rst) for 
testing locally, it’s a heavy docker but it ships with a working Airflow and a 
lot of integrations.
   - Be patient and persistent. It might take some time to get a review or get 
the final approval from Committers.
   - Please follow [ASF Code of 
Conduct](https://www.apache.org/foundation/policies/conduct) for all 
communication including (but not limited to) comments on Pull Requests, Mailing 
list and Slack.
   - Be sure to read the [Airflow Coding style]( 
https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#coding-style-and-best-practices).
   Apache Airflow is a community-driven project and together we are making it 
better .
   In case of doubts contact the developers at:
   Mailing List: d...@airflow.apache.org
   Slack: https://apache-airflow-slack.herokuapp.com/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] anmol-dhingra opened a new pull request #10820: Update qubole_hook do not remove pool as an arg for qubole operator

2020-09-08 Thread GitBox


anmol-dhingra opened a new pull request #10820:
URL: https://github.com/apache/airflow/pull/10820


   
   Parameter pool was not getting passed to the Base Operator it was using 
default_pool. Added it to the `options_to_remove` list so that it gets passed 
as it is.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] Shivarp1 edited a comment on issue #10722: KubernetesPodOperator with XCom enabled errors: 403 forbidden

2020-09-08 Thread GitBox


Shivarp1 edited a comment on issue #10722:
URL: https://github.com/apache/airflow/issues/10722#issuecomment-689279465


   @kaxil I think PodDefaults in pod_generator was introduced in 1.10.12?  I 
have not checked the 1.10.11 version
   In 1.10.10 we were using the kubernetes_request_factory, but I was getting 
different 403 error.  I didn't investigate further with that version. 
   However with this above fix , it is working with version 1.10.12; 
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] alexbegg commented on a change in pull request #10637: Fix Start Date tooltip on DAGs page not showing actual start_date

2020-09-08 Thread GitBox


alexbegg commented on a change in pull request #10637:
URL: https://github.com/apache/airflow/pull/10637#discussion_r485319633



##
File path: airflow/www/views.py
##
@@ -563,7 +563,9 @@ def last_dagruns(self, session=None):
 return wwwutils.json_response({})
 
 query = session.query(
-DagRun.dag_id, 
sqla.func.max(DagRun.execution_date).label('last_run')
+DagRun.dag_id,
+sqla.func.max(DagRun.execution_date).label('execution_date'),
+sqla.func.max(DagRun.start_date).label('start_date'),

Review comment:
   I can test for that, thanks for the heads up. However I would think the 
value above it, the execution_date, would be wrong to get the max value in the 
query, not the run_date. I don't recall that being the case as that query has 
been the same for a long time now (I just renamed the column alais) but if it 
is wrong, I'll try and fix it.





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] Shivarp1 commented on issue #10722: KubernetesPodOperator with XCom enabled errors: 403 forbidden

2020-09-08 Thread GitBox


Shivarp1 commented on issue #10722:
URL: https://github.com/apache/airflow/issues/10722#issuecomment-689279465


   @kaxil I think PodDefaults in pod_generator was introduced in 1.10.12?  I 
have not checked the 1.10.11 version
   In 1.10.10 we were using the kubernetes_request_factory. 
   
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] dimon222 edited a comment on issue #10722: KubernetesPodOperator with XCom enabled errors: 403 forbidden

2020-09-08 Thread GitBox


dimon222 edited a comment on issue #10722:
URL: https://github.com/apache/airflow/issues/10722#issuecomment-689229438


   @kaxil I can confirm that it didn't. It might be old outstanding issue back 
from 1.10.9 or even pre.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] dimon222 edited a comment on issue #10722: KubernetesPodOperator with XCom enabled errors: 403 forbidden

2020-09-08 Thread GitBox


dimon222 edited a comment on issue #10722:
URL: https://github.com/apache/airflow/issues/10722#issuecomment-689229438


   @kaxil I can confirm that it didn't. It might old outstanding issue back 
from 1.10.9



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] dimon222 commented on issue #10722: KubernetesPodOperator with XCom enabled errors: 403 forbidden

2020-09-08 Thread GitBox


dimon222 commented on issue #10722:
URL: https://github.com/apache/airflow/issues/10722#issuecomment-689229438


   @kaxil I can confirm that it didn't. It might old outstanding bug back from 
1.10.9



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] kaxil opened a new pull request #10819: Update log exception to reflect rename of execute_helper

2020-09-08 Thread GitBox


kaxil opened a new pull request #10819:
URL: https://github.com/apache/airflow/pull/10819


   `SchedulerJob.execute_helper` was renamed to 
`SchedulerJob._run_scheduler_loop`
   
   
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request 
Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)**
 for more information.
   In case of fundamental code change, Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals))
 is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party 
License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in 
[UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] atsalolikhin-spokeo commented on pull request #10413: Add documentation for preparing database for Airflow

2020-09-08 Thread GitBox


atsalolikhin-spokeo commented on pull request #10413:
URL: https://github.com/apache/airflow/pull/10413#issuecomment-689200729


   @mik-laj Thank you for your kind comment and direction.  I've fixed the code 
so the CI checks are happy, and rebased to tip of master.  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] kaxil commented on issue #10793: Mounting DAGS from an externally populated PVC doesn't work in K8 Executor

2020-09-08 Thread GitBox


kaxil commented on issue #10793:
URL: https://github.com/apache/airflow/issues/10793#issuecomment-689195627


   @gardnerdev  Can you confirm if this worked before Airflow 1.10.12?
   
   cc @dimberman 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] kaxil commented on issue #10722: KubernetesPodOperator with XCom enabled errors: 403 forbidden

2020-09-08 Thread GitBox


kaxil commented on issue #10722:
URL: https://github.com/apache/airflow/issues/10722#issuecomment-689195510


   @Shivarp1 Can you confirm if this worked before Airflow 1.10.12?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] kaxil opened a new pull request #10818: Upgrade black to 20.8b1

2020-09-08 Thread GitBox


kaxil opened a new pull request #10818:
URL: https://github.com/apache/airflow/pull/10818


   We got clarification in https://github.com/psf/black/issues/1667 that the 
new changes related to trailing commas are feature instead of a bug
   
   
   
   
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request 
Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)**
 for more information.
   In case of fundamental code change, Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals))
 is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party 
License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in 
[UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (AIRFLOW-3964) Consolidate and de-duplicate sensor tasks

2020-09-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192520#comment-17192520
 ] 

ASF GitHub Bot commented on AIRFLOW-3964:
-

kaxil edited a comment on pull request #5499:
URL: https://github.com/apache/airflow/pull/5499#issuecomment-689173053


   This is an awesome feature so thank you, @KevinYang21 and other folks from 
Airbnb team who worked on it.
   
   The follow-up PRs (that I can think of right now) based on Airflow 2.0 dev 
call are:
   
   - Clearly mark "Smart Sensor" as an early-access feature with a clear note 
that this feature might potentially change in future Airflow version with 
breaking changes. (https://github.com/apache/airflow/issues/10815)
   - Docs around different execution ways for Sensor: poke mode, reschedule 
mode vs Smart Sensor (https://github.com/apache/airflow/issues/10816)
   - [Enhancement] Explore if "smart sensor" can be used with a new mode 
(similar to reschedule) instead. 
(https://github.com/apache/airflow/issues/10817)
   
   The first would be needed before we release 2.0, while the last one can wait 
:)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Consolidate and de-duplicate sensor tasks 
> --
>
> Key: AIRFLOW-3964
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3964
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: dependencies, operators, scheduler
>Affects Versions: 1.10.0
>Reporter: Yingbo Wang
>Assignee: Yingbo Wang
>Priority: Critical
>
> h2. Problem
> h3. Airflow Sensor:
> Sensors are a certain type of operator that will keep running until a certain 
> criterion is met. Examples include a specific file landing in HDFS or S3, a 
> partition appearing in Hive, or a specific time of the day. Sensors are 
> derived from BaseSensorOperator and run a poke method at a specified 
> poke_interval until it returns True.
> Airflow Sensor duplication is a normal problem for large scale airflow 
> project. There are duplicated partitions needing to be detected from 
> same/different DAG. In Airbnb there are 88 boxes running four different types 
> of sensors everyday. The number of running sensor tasks ranges from 8k to 
> 16k, which takes great amount of resources. Although Airflow team had 
> redirected all sensors to a specific queue to allocate relatively minor 
> resource, there is still large room to reduce the number of workers and 
> relief DB pressure by optimizing the sensor mechanism.
> Existing sensor implementation creates an identical task for any sensor task 
> with specific dag_id, task_id and execution_date. This task is responsible of 
> keeping querying DB until the specified partitions exists. Even if two tasks 
> are waiting for same partition in DB, they are creating two connections with 
> the DB and checking the status in two separate processes. In one hand, DB 
> need to run duplicate jobs in multiple processes which will take both cpu and 
> memory resources. At the same time, Airflow need to maintain a process for 
> each sensor to query and wait for the partition/table to be created.
> h1. ***Design*
> There are several issues need to be resolved for our smart sensor. 
> h2. Persist sensor infor in DB and avoid file parsing before running
> Current Airflow implementation need to parse the DAG python file before 
> running a task. Parsing multiple python file in a smart sensor make the case 
> low efficiency and overload. Since sensor tasks need relatively more “light 
> weight” executing information -- less number of properties with simple 
> structure (most are built in type instead of function or object). We propose 
> to skip the parsing for smart sensor. The easiest way is to persist all task 
> instance information in airflow metaDB. 
> h3. Solution:
> It will be hard to dump the whole task instance object dictionary. And we do 
> not really need that much information. 
> We add two sets to the base sensor class as “persist_fields” and 
> “execute_fields”. 
> h4. “persist_fields”  dump to airflow.task_instance column “attr_dict”
> saves the attribute names that should be used to accomplish a sensor poking 
> job. For example:
>  #  the “NamedHivePartitionSensor” define its persist_fields as  
> ('partition_names', 'metastore_conn_id', 'hook') since these properties are 
> enough for its poking function. 
>  # While the HivePartitionSensor can be slightly different use persist_fields 
> as ('schema', 'table', 'partition', 'metastore_conn_id')
> If we have two tasks that have same 

[GitHub] [airflow] kaxil edited a comment on pull request #5499: [AIRFLOW-3964][AIP-17] Build smart sensor

2020-09-08 Thread GitBox


kaxil edited a comment on pull request #5499:
URL: https://github.com/apache/airflow/pull/5499#issuecomment-689173053


   This is an awesome feature so thank you, @KevinYang21 and other folks from 
Airbnb team who worked on it.
   
   The follow-up PRs (that I can think of right now) based on Airflow 2.0 dev 
call are:
   
   - Clearly mark "Smart Sensor" as an early-access feature with a clear note 
that this feature might potentially change in future Airflow version with 
breaking changes. (https://github.com/apache/airflow/issues/10815)
   - Docs around different execution ways for Sensor: poke mode, reschedule 
mode vs Smart Sensor (https://github.com/apache/airflow/issues/10816)
   - [Enhancement] Explore if "smart sensor" can be used with a new mode 
(similar to reschedule) instead. 
(https://github.com/apache/airflow/issues/10817)
   
   The first would be needed before we release 2.0, while the last one can wait 
:)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[airflow] branch master updated: Add pod_override setting for KubernetesExecutor (#10756)

2020-09-08 Thread dimberman
This is an automated email from the ASF dual-hosted git repository.

dimberman pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/airflow.git


The following commit(s) were added to refs/heads/master by this push:
 new 20481c3  Add pod_override setting for KubernetesExecutor (#10756)
20481c3 is described below

commit 20481c3cafb89bcf9b8a1b3a5f7c5470b11838c3
Author: Daniel Imberman 
AuthorDate: Tue Sep 8 15:56:59 2020 -0700

Add pod_override setting for KubernetesExecutor (#10756)

* Add podOverride setting for KubernetesExecutor

Users of the KubernetesExecutor will now have a "podOverride"
option in the executor_config. This option will allow users to
modify the pod launched by the KubernetesExecutor using a
`kubernetes.client.models.V1Pod` class. This is the first step
in deprecating the tradition executor_config.

* Fix k8s tests

* fix docs
---
 .../example_kubernetes_executor_config.py  | 100 +
 airflow/kubernetes/pod_generator.py|  56 +++-
 airflow/serialization/enums.py |   1 +
 airflow/serialization/serialized_objects.py|   8 ++
 docs/concepts.rst  |   9 ++
 docs/executor/kubernetes.rst   |  29 ++
 tests/kubernetes/test_pod_generator.py |  40 +
 tests/serialization/test_dag_serialization.py  |  27 +-
 8 files changed, 248 insertions(+), 22 deletions(-)

diff --git a/airflow/example_dags/example_kubernetes_executor_config.py 
b/airflow/example_dags/example_kubernetes_executor_config.py
index 5fef135..db64b31 100644
--- a/airflow/example_dags/example_kubernetes_executor_config.py
+++ b/airflow/example_dags/example_kubernetes_executor_config.py
@@ -20,6 +20,8 @@ This is an example dag for using a Kubernetes Executor 
Configuration.
 """
 import os
 
+from kubernetes.client import models as k8s
+
 from airflow import DAG
 from airflow.example_dags.libs.helper import print_stuff
 from airflow.operators.python import PythonOperator
@@ -37,6 +39,19 @@ with DAG(
 tags=['example'],
 ) as dag:
 
+def test_sharedvolume_mount():
+"""
+Tests whether the volume has been mounted.
+"""
+for i in range(5):
+try:
+return_code = os.system("cat /shared/test.txt")
+if return_code != 0:
+raise ValueError(f"Error when checking volume mount. 
Return code {return_code}")
+except ValueError as e:
+if i > 4:
+raise e
+
 def test_volume_mount():
 """
 Tests whether the volume has been mounted.
@@ -59,27 +74,75 @@ with DAG(
 }
 )
 
-# You can mount volume or secret to the worker pod
-second_task = PythonOperator(
-task_id="four_task",
+# [START task_with_volume]
+volume_task = PythonOperator(
+task_id="task_with_volume",
 python_callable=test_volume_mount,
 executor_config={
-"KubernetesExecutor": {
-"volumes": [
-{
-"name": "example-kubernetes-test-volume",
-"hostPath": {"path": "/tmp/"},
-},
-],
-"volume_mounts": [
-{
-"mountPath": "/foo/",
-"name": "example-kubernetes-test-volume",
-},
-]
-}
+"pod_override": k8s.V1Pod(
+spec=k8s.V1PodSpec(
+containers=[
+k8s.V1Container(
+name="base",
+volume_mounts=[
+k8s.V1VolumeMount(
+mount_path="/foo/",
+name="example-kubernetes-test-volume"
+)
+]
+)
+],
+volumes=[
+k8s.V1Volume(
+name="example-kubernetes-test-volume",
+host_path=k8s.V1HostPathVolumeSource(
+path="/tmp/"
+)
+)
+]
+)
+),
+}
+)
+# [END task_with_volume]
+
+# [START task_with_sidecar]
+sidecar_task = PythonOperator(
+task_id="task_with_sidecar",
+python_callable=test_sharedvolume_mount,
+executor_config={
+"pod_override": k8s.V1Pod(
+spec=k8s.V1PodSpec(
+containers=[
+k8s.V1Container(
+name="base",
+volume_mounts=[k8s.V1VolumeMount(
+

[GitHub] [airflow] kaxil commented on issue #10817: Explore if "smart sensor" can be used with a new mode (similar to reschedule) instead.

2020-09-08 Thread GitBox


kaxil commented on issue #10817:
URL: https://github.com/apache/airflow/issues/10817#issuecomment-689179160


   cc @KevinYang21 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] dimberman merged pull request #10756: Add pod_override setting for KubernetesExecutor

2020-09-08 Thread GitBox


dimberman merged pull request #10756:
URL: https://github.com/apache/airflow/pull/10756


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] kaxil opened a new issue #10817: Explore if "smart sensor" can be used with a new mode (similar to reschedule) instead.

2020-09-08 Thread GitBox


kaxil opened a new issue #10817:
URL: https://github.com/apache/airflow/issues/10817


   An enhancement to Smart Sensor feature introduced in 
https://github.com/apache/airflow/pull/5499 might be to have a separate “mode” 
like "reschedule mode".
   
   This would also simplify making Sensors Smart Sensor compatible and unify 
methods with 'reschedule' mode.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (AIRFLOW-3964) Consolidate and de-duplicate sensor tasks

2020-09-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192519#comment-17192519
 ] 

ASF GitHub Bot commented on AIRFLOW-3964:
-

kaxil edited a comment on pull request #5499:
URL: https://github.com/apache/airflow/pull/5499#issuecomment-689173053


   This is an awesome feature so thank you, @KevinYang21 and other folks from 
Airbnb team who worked on it.
   
   The follow-up PRs (that I can think of right now) based on Airflow 2.0 dev 
call are:
   
   - Clearly mark "Smart Sensor" as an early-access feature with a clear note 
that this feature might potentially change in future Airflow version with 
breaking changes. (https://github.com/apache/airflow/issues/10815)
   - Docs around different execution ways for Sensor: poke mode, reschedule 
mode vs Smart Sensor (https://github.com/apache/airflow/issues/10816)
   - [Enhancement] Explore if "smart sensor" can be used with a new mode 
(similar to reschedule) instead.
   
   The first would be needed before we release 2.0, while the last one can wait 
:)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Consolidate and de-duplicate sensor tasks 
> --
>
> Key: AIRFLOW-3964
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3964
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: dependencies, operators, scheduler
>Affects Versions: 1.10.0
>Reporter: Yingbo Wang
>Assignee: Yingbo Wang
>Priority: Critical
>
> h2. Problem
> h3. Airflow Sensor:
> Sensors are a certain type of operator that will keep running until a certain 
> criterion is met. Examples include a specific file landing in HDFS or S3, a 
> partition appearing in Hive, or a specific time of the day. Sensors are 
> derived from BaseSensorOperator and run a poke method at a specified 
> poke_interval until it returns True.
> Airflow Sensor duplication is a normal problem for large scale airflow 
> project. There are duplicated partitions needing to be detected from 
> same/different DAG. In Airbnb there are 88 boxes running four different types 
> of sensors everyday. The number of running sensor tasks ranges from 8k to 
> 16k, which takes great amount of resources. Although Airflow team had 
> redirected all sensors to a specific queue to allocate relatively minor 
> resource, there is still large room to reduce the number of workers and 
> relief DB pressure by optimizing the sensor mechanism.
> Existing sensor implementation creates an identical task for any sensor task 
> with specific dag_id, task_id and execution_date. This task is responsible of 
> keeping querying DB until the specified partitions exists. Even if two tasks 
> are waiting for same partition in DB, they are creating two connections with 
> the DB and checking the status in two separate processes. In one hand, DB 
> need to run duplicate jobs in multiple processes which will take both cpu and 
> memory resources. At the same time, Airflow need to maintain a process for 
> each sensor to query and wait for the partition/table to be created.
> h1. ***Design*
> There are several issues need to be resolved for our smart sensor. 
> h2. Persist sensor infor in DB and avoid file parsing before running
> Current Airflow implementation need to parse the DAG python file before 
> running a task. Parsing multiple python file in a smart sensor make the case 
> low efficiency and overload. Since sensor tasks need relatively more “light 
> weight” executing information -- less number of properties with simple 
> structure (most are built in type instead of function or object). We propose 
> to skip the parsing for smart sensor. The easiest way is to persist all task 
> instance information in airflow metaDB. 
> h3. Solution:
> It will be hard to dump the whole task instance object dictionary. And we do 
> not really need that much information. 
> We add two sets to the base sensor class as “persist_fields” and 
> “execute_fields”. 
> h4. “persist_fields”  dump to airflow.task_instance column “attr_dict”
> saves the attribute names that should be used to accomplish a sensor poking 
> job. For example:
>  #  the “NamedHivePartitionSensor” define its persist_fields as  
> ('partition_names', 'metastore_conn_id', 'hook') since these properties are 
> enough for its poking function. 
>  # While the HivePartitionSensor can be slightly different use persist_fields 
> as ('schema', 'table', 'partition', 'metastore_conn_id')
> If we have two tasks that have same property value for all field in 
> 

[GitHub] [airflow] kaxil commented on issue #10816: Docs around different execution modes for Sensor

2020-09-08 Thread GitBox


kaxil commented on issue #10816:
URL: https://github.com/apache/airflow/issues/10816#issuecomment-689177720


   cc @YingboWang @KevinYang21  



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] kaxil opened a new issue #10816: Docs around different execution modes for Sensor

2020-09-08 Thread GitBox


kaxil opened a new issue #10816:
URL: https://github.com/apache/airflow/issues/10816


   **Description**
   
   It would be good to add docs to explain different modes for Sensors:
   
   1. Poke mode
   1. Reschedule mode
   1. Smart Sensor
   
   and to explain the advantages of one over the other



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] kaxil edited a comment on pull request #5499: [AIRFLOW-3964][AIP-17] Build smart sensor

2020-09-08 Thread GitBox


kaxil edited a comment on pull request #5499:
URL: https://github.com/apache/airflow/pull/5499#issuecomment-689173053


   This is an awesome feature so thank you, @KevinYang21 and other folks from 
Airbnb team who worked on it.
   
   The follow-up PRs (that I can think of right now) based on Airflow 2.0 dev 
call are:
   
   - Clearly mark "Smart Sensor" as an early-access feature with a clear note 
that this feature might potentially change in future Airflow version with 
breaking changes. (https://github.com/apache/airflow/issues/10815)
   - Docs around different execution ways for Sensor: poke mode, reschedule 
mode vs Smart Sensor (https://github.com/apache/airflow/issues/10816)
   - [Enhancement] Explore if "smart sensor" can be used with a new mode 
(similar to reschedule) instead.
   
   The first would be needed before we release 2.0, while the last one can wait 
:)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] kaxil edited a comment on issue #10815: Mark "Smart Sensor" as an early-access feature

2020-09-08 Thread GitBox


kaxil edited a comment on issue #10815:
URL: https://github.com/apache/airflow/issues/10815#issuecomment-689174895


   @YingboWang @KevinYang21  Let me if you would like to do it. If not, we 
could mark this as a "good first issue" :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] kaxil commented on issue #10815: Mark "Smart Sensor" as an early-access feature

2020-09-08 Thread GitBox


kaxil commented on issue #10815:
URL: https://github.com/apache/airflow/issues/10815#issuecomment-689174895


   @YingboWang Let me if you would love to do it



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (AIRFLOW-3964) Consolidate and de-duplicate sensor tasks

2020-09-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192518#comment-17192518
 ] 

ASF GitHub Bot commented on AIRFLOW-3964:
-

kaxil edited a comment on pull request #5499:
URL: https://github.com/apache/airflow/pull/5499#issuecomment-689173053


   This is an awesome feature so thank you, @KevinYang21 and other folks from 
Airbnb team who worked on it.
   
   The follow-up PRs (that I can think of right now) based on Airflow 2.0 dev 
call are:
   
   - Clearly mark "Smart Sensor" as an early-access feature with a clear note 
that this feature might potentially change in future Airflow version with 
breaking changes. (https://github.com/apache/airflow/issues/10815)
   - Docs around different execution ways for Sensor: poke mode, reschedule 
mode vs Smart Sensor
   - [Enhancement] Explore if "smart sensor" can be used with a new mode 
(similar to reschedule) instead.
   
   The first would be needed before we release 2.0, while the last one can wait 
:)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Consolidate and de-duplicate sensor tasks 
> --
>
> Key: AIRFLOW-3964
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3964
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: dependencies, operators, scheduler
>Affects Versions: 1.10.0
>Reporter: Yingbo Wang
>Assignee: Yingbo Wang
>Priority: Critical
>
> h2. Problem
> h3. Airflow Sensor:
> Sensors are a certain type of operator that will keep running until a certain 
> criterion is met. Examples include a specific file landing in HDFS or S3, a 
> partition appearing in Hive, or a specific time of the day. Sensors are 
> derived from BaseSensorOperator and run a poke method at a specified 
> poke_interval until it returns True.
> Airflow Sensor duplication is a normal problem for large scale airflow 
> project. There are duplicated partitions needing to be detected from 
> same/different DAG. In Airbnb there are 88 boxes running four different types 
> of sensors everyday. The number of running sensor tasks ranges from 8k to 
> 16k, which takes great amount of resources. Although Airflow team had 
> redirected all sensors to a specific queue to allocate relatively minor 
> resource, there is still large room to reduce the number of workers and 
> relief DB pressure by optimizing the sensor mechanism.
> Existing sensor implementation creates an identical task for any sensor task 
> with specific dag_id, task_id and execution_date. This task is responsible of 
> keeping querying DB until the specified partitions exists. Even if two tasks 
> are waiting for same partition in DB, they are creating two connections with 
> the DB and checking the status in two separate processes. In one hand, DB 
> need to run duplicate jobs in multiple processes which will take both cpu and 
> memory resources. At the same time, Airflow need to maintain a process for 
> each sensor to query and wait for the partition/table to be created.
> h1. ***Design*
> There are several issues need to be resolved for our smart sensor. 
> h2. Persist sensor infor in DB and avoid file parsing before running
> Current Airflow implementation need to parse the DAG python file before 
> running a task. Parsing multiple python file in a smart sensor make the case 
> low efficiency and overload. Since sensor tasks need relatively more “light 
> weight” executing information -- less number of properties with simple 
> structure (most are built in type instead of function or object). We propose 
> to skip the parsing for smart sensor. The easiest way is to persist all task 
> instance information in airflow metaDB. 
> h3. Solution:
> It will be hard to dump the whole task instance object dictionary. And we do 
> not really need that much information. 
> We add two sets to the base sensor class as “persist_fields” and 
> “execute_fields”. 
> h4. “persist_fields”  dump to airflow.task_instance column “attr_dict”
> saves the attribute names that should be used to accomplish a sensor poking 
> job. For example:
>  #  the “NamedHivePartitionSensor” define its persist_fields as  
> ('partition_names', 'metastore_conn_id', 'hook') since these properties are 
> enough for its poking function. 
>  # While the HivePartitionSensor can be slightly different use persist_fields 
> as ('schema', 'table', 'partition', 'metastore_conn_id')
> If we have two tasks that have same property value for all field in 
> persist_fields. That means these two tasks are poking the same 

[GitHub] [airflow] kaxil edited a comment on pull request #5499: [AIRFLOW-3964][AIP-17] Build smart sensor

2020-09-08 Thread GitBox


kaxil edited a comment on pull request #5499:
URL: https://github.com/apache/airflow/pull/5499#issuecomment-689173053


   This is an awesome feature so thank you, @KevinYang21 and other folks from 
Airbnb team who worked on it.
   
   The follow-up PRs (that I can think of right now) based on Airflow 2.0 dev 
call are:
   
   - Clearly mark "Smart Sensor" as an early-access feature with a clear note 
that this feature might potentially change in future Airflow version with 
breaking changes. (https://github.com/apache/airflow/issues/10815)
   - Docs around different execution ways for Sensor: poke mode, reschedule 
mode vs Smart Sensor
   - [Enhancement] Explore if "smart sensor" can be used with a new mode 
(similar to reschedule) instead.
   
   The first would be needed before we release 2.0, while the last one can wait 
:)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] kaxil opened a new issue #10815: Mark "Smart Sensor" as an early-access feature

2020-09-08 Thread GitBox


kaxil opened a new issue #10815:
URL: https://github.com/apache/airflow/issues/10815


   **Description**
   
   Based on our discussion during Airflow 2.0 dev call: Smart Sensors Will be 
included in 2.0 as an early-access feature with a clear note that this feature 
might potentially change in future Airflow version with breaking changes.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] kaxil commented on pull request #5499: [AIRFLOW-3964][AIP-17] Build smart sensor

2020-09-08 Thread GitBox


kaxil commented on pull request #5499:
URL: https://github.com/apache/airflow/pull/5499#issuecomment-689173053


   This is an awesome feature so thank you, @KevinYang21 and other folks from 
Airbnb team who worked on it.
   
   The follow-up PRs (that I can think of right now) based on Airflow 2.0 dev 
call are:
   
   - Clearly mark "Smart Sensor" as an early-access feature with a clear note 
that this feature might potentially change in future Airflow version with 
breaking changes. 
   - Docs around different execution ways for Sensor: poke mode, reschedule 
mode vs Smart Sensor
   - [Enhancement] Explore if "smart sensor" can be used with a new mode 
(similar to reschedule) instead.
   
   The first would be needed before we release 2.0, while the last one can wait 
:)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (AIRFLOW-3964) Consolidate and de-duplicate sensor tasks

2020-09-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192517#comment-17192517
 ] 

ASF GitHub Bot commented on AIRFLOW-3964:
-

kaxil commented on pull request #5499:
URL: https://github.com/apache/airflow/pull/5499#issuecomment-689173053


   This is an awesome feature so thank you, @KevinYang21 and other folks from 
Airbnb team who worked on it.
   
   The follow-up PRs (that I can think of right now) based on Airflow 2.0 dev 
call are:
   
   - Clearly mark "Smart Sensor" as an early-access feature with a clear note 
that this feature might potentially change in future Airflow version with 
breaking changes. 
   - Docs around different execution ways for Sensor: poke mode, reschedule 
mode vs Smart Sensor
   - [Enhancement] Explore if "smart sensor" can be used with a new mode 
(similar to reschedule) instead.
   
   The first would be needed before we release 2.0, while the last one can wait 
:)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Consolidate and de-duplicate sensor tasks 
> --
>
> Key: AIRFLOW-3964
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3964
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: dependencies, operators, scheduler
>Affects Versions: 1.10.0
>Reporter: Yingbo Wang
>Assignee: Yingbo Wang
>Priority: Critical
>
> h2. Problem
> h3. Airflow Sensor:
> Sensors are a certain type of operator that will keep running until a certain 
> criterion is met. Examples include a specific file landing in HDFS or S3, a 
> partition appearing in Hive, or a specific time of the day. Sensors are 
> derived from BaseSensorOperator and run a poke method at a specified 
> poke_interval until it returns True.
> Airflow Sensor duplication is a normal problem for large scale airflow 
> project. There are duplicated partitions needing to be detected from 
> same/different DAG. In Airbnb there are 88 boxes running four different types 
> of sensors everyday. The number of running sensor tasks ranges from 8k to 
> 16k, which takes great amount of resources. Although Airflow team had 
> redirected all sensors to a specific queue to allocate relatively minor 
> resource, there is still large room to reduce the number of workers and 
> relief DB pressure by optimizing the sensor mechanism.
> Existing sensor implementation creates an identical task for any sensor task 
> with specific dag_id, task_id and execution_date. This task is responsible of 
> keeping querying DB until the specified partitions exists. Even if two tasks 
> are waiting for same partition in DB, they are creating two connections with 
> the DB and checking the status in two separate processes. In one hand, DB 
> need to run duplicate jobs in multiple processes which will take both cpu and 
> memory resources. At the same time, Airflow need to maintain a process for 
> each sensor to query and wait for the partition/table to be created.
> h1. ***Design*
> There are several issues need to be resolved for our smart sensor. 
> h2. Persist sensor infor in DB and avoid file parsing before running
> Current Airflow implementation need to parse the DAG python file before 
> running a task. Parsing multiple python file in a smart sensor make the case 
> low efficiency and overload. Since sensor tasks need relatively more “light 
> weight” executing information -- less number of properties with simple 
> structure (most are built in type instead of function or object). We propose 
> to skip the parsing for smart sensor. The easiest way is to persist all task 
> instance information in airflow metaDB. 
> h3. Solution:
> It will be hard to dump the whole task instance object dictionary. And we do 
> not really need that much information. 
> We add two sets to the base sensor class as “persist_fields” and 
> “execute_fields”. 
> h4. “persist_fields”  dump to airflow.task_instance column “attr_dict”
> saves the attribute names that should be used to accomplish a sensor poking 
> job. For example:
>  #  the “NamedHivePartitionSensor” define its persist_fields as  
> ('partition_names', 'metastore_conn_id', 'hook') since these properties are 
> enough for its poking function. 
>  # While the HivePartitionSensor can be slightly different use persist_fields 
> as ('schema', 'table', 'partition', 'metastore_conn_id')
> If we have two tasks that have same property value for all field in 
> persist_fields. That means these two tasks are poking the same item and they 
> are holding duplicate poking jobs in 

[GitHub] [airflow] kenjihiraoka commented on pull request #9943: Increase typing coverage for postgres provider

2020-09-08 Thread GitBox


kenjihiraoka commented on pull request #9943:
URL: https://github.com/apache/airflow/pull/9943#issuecomment-689170646


   Sorry man, I'll fix that wrong rebase :sweat: 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] ephraimbuddy opened a new pull request #10814: Add AzureDataLakeUploadOperator

2020-09-08 Thread GitBox


ephraimbuddy opened a new pull request #10814:
URL: https://github.com/apache/airflow/pull/10814


   This PR adds AzureDataLakeUploadOperator. This operator will help to add 
system test for ADLSToGCSOperator
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request 
Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)**
 for more information.
   In case of fundamental code change, Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals))
 is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party 
License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in 
[UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (AIRFLOW-3964) Consolidate and de-duplicate sensor tasks

2020-09-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192510#comment-17192510
 ] 

ASF GitHub Bot commented on AIRFLOW-3964:
-

potiuk commented on pull request #5499:
URL: https://github.com/apache/airflow/pull/5499#issuecomment-689164512


   :tada: !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Consolidate and de-duplicate sensor tasks 
> --
>
> Key: AIRFLOW-3964
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3964
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: dependencies, operators, scheduler
>Affects Versions: 1.10.0
>Reporter: Yingbo Wang
>Assignee: Yingbo Wang
>Priority: Critical
>
> h2. Problem
> h3. Airflow Sensor:
> Sensors are a certain type of operator that will keep running until a certain 
> criterion is met. Examples include a specific file landing in HDFS or S3, a 
> partition appearing in Hive, or a specific time of the day. Sensors are 
> derived from BaseSensorOperator and run a poke method at a specified 
> poke_interval until it returns True.
> Airflow Sensor duplication is a normal problem for large scale airflow 
> project. There are duplicated partitions needing to be detected from 
> same/different DAG. In Airbnb there are 88 boxes running four different types 
> of sensors everyday. The number of running sensor tasks ranges from 8k to 
> 16k, which takes great amount of resources. Although Airflow team had 
> redirected all sensors to a specific queue to allocate relatively minor 
> resource, there is still large room to reduce the number of workers and 
> relief DB pressure by optimizing the sensor mechanism.
> Existing sensor implementation creates an identical task for any sensor task 
> with specific dag_id, task_id and execution_date. This task is responsible of 
> keeping querying DB until the specified partitions exists. Even if two tasks 
> are waiting for same partition in DB, they are creating two connections with 
> the DB and checking the status in two separate processes. In one hand, DB 
> need to run duplicate jobs in multiple processes which will take both cpu and 
> memory resources. At the same time, Airflow need to maintain a process for 
> each sensor to query and wait for the partition/table to be created.
> h1. ***Design*
> There are several issues need to be resolved for our smart sensor. 
> h2. Persist sensor infor in DB and avoid file parsing before running
> Current Airflow implementation need to parse the DAG python file before 
> running a task. Parsing multiple python file in a smart sensor make the case 
> low efficiency and overload. Since sensor tasks need relatively more “light 
> weight” executing information -- less number of properties with simple 
> structure (most are built in type instead of function or object). We propose 
> to skip the parsing for smart sensor. The easiest way is to persist all task 
> instance information in airflow metaDB. 
> h3. Solution:
> It will be hard to dump the whole task instance object dictionary. And we do 
> not really need that much information. 
> We add two sets to the base sensor class as “persist_fields” and 
> “execute_fields”. 
> h4. “persist_fields”  dump to airflow.task_instance column “attr_dict”
> saves the attribute names that should be used to accomplish a sensor poking 
> job. For example:
>  #  the “NamedHivePartitionSensor” define its persist_fields as  
> ('partition_names', 'metastore_conn_id', 'hook') since these properties are 
> enough for its poking function. 
>  # While the HivePartitionSensor can be slightly different use persist_fields 
> as ('schema', 'table', 'partition', 'metastore_conn_id')
> If we have two tasks that have same property value for all field in 
> persist_fields. That means these two tasks are poking the same item and they 
> are holding duplicate poking jobs in senser. 
> *The persist_fields can help us in deduplicate sensor tasks*. In a more 
> broader way. If we can list persist_fields for all operators, it can help to 
> dedup all airflow tasks.
> h4. “Execute_fields” dump to airflow.task_instance column “exec_dict”
> Saves the execution configuration such as “poke_interval”, “timeout”, 
> “execution_timeout”
> Fields in this set do not contain information affecting the poking job 
> detail. They are related to how frequent should we poke, when should the task 
> timeout, how many times timeout should be a fail etc. We only put those logic 
> that we can easily handle in a smart sensor for now. This is a smart 

[GitHub] [airflow] potiuk commented on pull request #5499: [AIRFLOW-3964][AIP-17] Build smart sensor

2020-09-08 Thread GitBox


potiuk commented on pull request #5499:
URL: https://github.com/apache/airflow/pull/5499#issuecomment-689164512


   :tada: !



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Resolved] (AIRFLOW-3964) Consolidate and de-duplicate sensor tasks

2020-09-08 Thread Kaxil Naik (Jira)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Kaxil Naik resolved AIRFLOW-3964.
-
Resolution: Fixed

> Consolidate and de-duplicate sensor tasks 
> --
>
> Key: AIRFLOW-3964
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3964
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: dependencies, operators, scheduler
>Affects Versions: 1.10.0
>Reporter: Yingbo Wang
>Assignee: Yingbo Wang
>Priority: Critical
>
> h2. Problem
> h3. Airflow Sensor:
> Sensors are a certain type of operator that will keep running until a certain 
> criterion is met. Examples include a specific file landing in HDFS or S3, a 
> partition appearing in Hive, or a specific time of the day. Sensors are 
> derived from BaseSensorOperator and run a poke method at a specified 
> poke_interval until it returns True.
> Airflow Sensor duplication is a normal problem for large scale airflow 
> project. There are duplicated partitions needing to be detected from 
> same/different DAG. In Airbnb there are 88 boxes running four different types 
> of sensors everyday. The number of running sensor tasks ranges from 8k to 
> 16k, which takes great amount of resources. Although Airflow team had 
> redirected all sensors to a specific queue to allocate relatively minor 
> resource, there is still large room to reduce the number of workers and 
> relief DB pressure by optimizing the sensor mechanism.
> Existing sensor implementation creates an identical task for any sensor task 
> with specific dag_id, task_id and execution_date. This task is responsible of 
> keeping querying DB until the specified partitions exists. Even if two tasks 
> are waiting for same partition in DB, they are creating two connections with 
> the DB and checking the status in two separate processes. In one hand, DB 
> need to run duplicate jobs in multiple processes which will take both cpu and 
> memory resources. At the same time, Airflow need to maintain a process for 
> each sensor to query and wait for the partition/table to be created.
> h1. ***Design*
> There are several issues need to be resolved for our smart sensor. 
> h2. Persist sensor infor in DB and avoid file parsing before running
> Current Airflow implementation need to parse the DAG python file before 
> running a task. Parsing multiple python file in a smart sensor make the case 
> low efficiency and overload. Since sensor tasks need relatively more “light 
> weight” executing information -- less number of properties with simple 
> structure (most are built in type instead of function or object). We propose 
> to skip the parsing for smart sensor. The easiest way is to persist all task 
> instance information in airflow metaDB. 
> h3. Solution:
> It will be hard to dump the whole task instance object dictionary. And we do 
> not really need that much information. 
> We add two sets to the base sensor class as “persist_fields” and 
> “execute_fields”. 
> h4. “persist_fields”  dump to airflow.task_instance column “attr_dict”
> saves the attribute names that should be used to accomplish a sensor poking 
> job. For example:
>  #  the “NamedHivePartitionSensor” define its persist_fields as  
> ('partition_names', 'metastore_conn_id', 'hook') since these properties are 
> enough for its poking function. 
>  # While the HivePartitionSensor can be slightly different use persist_fields 
> as ('schema', 'table', 'partition', 'metastore_conn_id')
> If we have two tasks that have same property value for all field in 
> persist_fields. That means these two tasks are poking the same item and they 
> are holding duplicate poking jobs in senser. 
> *The persist_fields can help us in deduplicate sensor tasks*. In a more 
> broader way. If we can list persist_fields for all operators, it can help to 
> dedup all airflow tasks.
> h4. “Execute_fields” dump to airflow.task_instance column “exec_dict”
> Saves the execution configuration such as “poke_interval”, “timeout”, 
> “execution_timeout”
> Fields in this set do not contain information affecting the poking job 
> detail. They are related to how frequent should we poke, when should the task 
> timeout, how many times timeout should be a fail etc. We only put those logic 
> that we can easily handle in a smart sensor for now. This is a smart sensor 
> “doable whitelist” and can be extended with more logic being “unlocked” by 
> smart sensor implementation. 
>  When we initialize a task instance object. We dump the attribute value of 
> these two sets and persist them in the Airflow metaDB. Smart sensor can visit 
> DB to get all required information of running sensor tasks and don’t need to 
> parse any DAG files.
> h2. Airflow scheduler change
> We do not want to break any existing logic in scheduler. 

[GitHub] [airflow] YingboWang commented on pull request #5499: [AIRFLOW-3964][AIP-17] Build smart sensor

2020-09-08 Thread GitBox


YingboWang commented on pull request #5499:
URL: https://github.com/apache/airflow/pull/5499#issuecomment-689162322


   > If anyone has any more suggestions or want to request changes, let's do it 
in a follow-up PR.
   > 
   > Thanks alot @YingboWang and apologies for the long wait.
   
   Thank you @kaxil  for reviewing and merging this PR! Appreciate all the 
help.   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (AIRFLOW-3964) Consolidate and de-duplicate sensor tasks

2020-09-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192506#comment-17192506
 ] 

ASF GitHub Bot commented on AIRFLOW-3964:
-

YingboWang commented on pull request #5499:
URL: https://github.com/apache/airflow/pull/5499#issuecomment-689162322


   > If anyone has any more suggestions or want to request changes, let's do it 
in a follow-up PR.
   > 
   > Thanks alot @YingboWang and apologies for the long wait.
   
   Thank you @kaxil  for reviewing and merging this PR! Appreciate all the 
help.   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Consolidate and de-duplicate sensor tasks 
> --
>
> Key: AIRFLOW-3964
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3964
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: dependencies, operators, scheduler
>Affects Versions: 1.10.0
>Reporter: Yingbo Wang
>Assignee: Yingbo Wang
>Priority: Critical
>
> h2. Problem
> h3. Airflow Sensor:
> Sensors are a certain type of operator that will keep running until a certain 
> criterion is met. Examples include a specific file landing in HDFS or S3, a 
> partition appearing in Hive, or a specific time of the day. Sensors are 
> derived from BaseSensorOperator and run a poke method at a specified 
> poke_interval until it returns True.
> Airflow Sensor duplication is a normal problem for large scale airflow 
> project. There are duplicated partitions needing to be detected from 
> same/different DAG. In Airbnb there are 88 boxes running four different types 
> of sensors everyday. The number of running sensor tasks ranges from 8k to 
> 16k, which takes great amount of resources. Although Airflow team had 
> redirected all sensors to a specific queue to allocate relatively minor 
> resource, there is still large room to reduce the number of workers and 
> relief DB pressure by optimizing the sensor mechanism.
> Existing sensor implementation creates an identical task for any sensor task 
> with specific dag_id, task_id and execution_date. This task is responsible of 
> keeping querying DB until the specified partitions exists. Even if two tasks 
> are waiting for same partition in DB, they are creating two connections with 
> the DB and checking the status in two separate processes. In one hand, DB 
> need to run duplicate jobs in multiple processes which will take both cpu and 
> memory resources. At the same time, Airflow need to maintain a process for 
> each sensor to query and wait for the partition/table to be created.
> h1. ***Design*
> There are several issues need to be resolved for our smart sensor. 
> h2. Persist sensor infor in DB and avoid file parsing before running
> Current Airflow implementation need to parse the DAG python file before 
> running a task. Parsing multiple python file in a smart sensor make the case 
> low efficiency and overload. Since sensor tasks need relatively more “light 
> weight” executing information -- less number of properties with simple 
> structure (most are built in type instead of function or object). We propose 
> to skip the parsing for smart sensor. The easiest way is to persist all task 
> instance information in airflow metaDB. 
> h3. Solution:
> It will be hard to dump the whole task instance object dictionary. And we do 
> not really need that much information. 
> We add two sets to the base sensor class as “persist_fields” and 
> “execute_fields”. 
> h4. “persist_fields”  dump to airflow.task_instance column “attr_dict”
> saves the attribute names that should be used to accomplish a sensor poking 
> job. For example:
>  #  the “NamedHivePartitionSensor” define its persist_fields as  
> ('partition_names', 'metastore_conn_id', 'hook') since these properties are 
> enough for its poking function. 
>  # While the HivePartitionSensor can be slightly different use persist_fields 
> as ('schema', 'table', 'partition', 'metastore_conn_id')
> If we have two tasks that have same property value for all field in 
> persist_fields. That means these two tasks are poking the same item and they 
> are holding duplicate poking jobs in senser. 
> *The persist_fields can help us in deduplicate sensor tasks*. In a more 
> broader way. If we can list persist_fields for all operators, it can help to 
> dedup all airflow tasks.
> h4. “Execute_fields” dump to airflow.task_instance column “exec_dict”
> Saves the execution configuration such as “poke_interval”, “timeout”, 
> “execution_timeout”
> Fields in this set do not contain information affecting 

[GitHub] [airflow] nikste commented on issue #7907: End-to-end DAG testing

2020-09-08 Thread GitBox


nikste commented on issue #7907:
URL: https://github.com/apache/airflow/issues/7907#issuecomment-689160101


   any news on this?
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[jira] [Commented] (AIRFLOW-3964) Consolidate and de-duplicate sensor tasks

2020-09-08 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192499#comment-17192499
 ] 

ASF subversion and git services commented on AIRFLOW-3964:
--

Commit ac943c9e18f75259d531dbda8c51e650f57faa4c in airflow's branch 
refs/heads/master from Yingbo Wang
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=ac943c9 ]

[AIRFLOW-3964][AIP-17] Consolidate and de-dup sensor tasks using Smart Sensor 
(#5499)

Co-authored-by: Yingbo Wang 

> Consolidate and de-duplicate sensor tasks 
> --
>
> Key: AIRFLOW-3964
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3964
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: dependencies, operators, scheduler
>Affects Versions: 1.10.0
>Reporter: Yingbo Wang
>Assignee: Yingbo Wang
>Priority: Critical
>
> h2. Problem
> h3. Airflow Sensor:
> Sensors are a certain type of operator that will keep running until a certain 
> criterion is met. Examples include a specific file landing in HDFS or S3, a 
> partition appearing in Hive, or a specific time of the day. Sensors are 
> derived from BaseSensorOperator and run a poke method at a specified 
> poke_interval until it returns True.
> Airflow Sensor duplication is a normal problem for large scale airflow 
> project. There are duplicated partitions needing to be detected from 
> same/different DAG. In Airbnb there are 88 boxes running four different types 
> of sensors everyday. The number of running sensor tasks ranges from 8k to 
> 16k, which takes great amount of resources. Although Airflow team had 
> redirected all sensors to a specific queue to allocate relatively minor 
> resource, there is still large room to reduce the number of workers and 
> relief DB pressure by optimizing the sensor mechanism.
> Existing sensor implementation creates an identical task for any sensor task 
> with specific dag_id, task_id and execution_date. This task is responsible of 
> keeping querying DB until the specified partitions exists. Even if two tasks 
> are waiting for same partition in DB, they are creating two connections with 
> the DB and checking the status in two separate processes. In one hand, DB 
> need to run duplicate jobs in multiple processes which will take both cpu and 
> memory resources. At the same time, Airflow need to maintain a process for 
> each sensor to query and wait for the partition/table to be created.
> h1. ***Design*
> There are several issues need to be resolved for our smart sensor. 
> h2. Persist sensor infor in DB and avoid file parsing before running
> Current Airflow implementation need to parse the DAG python file before 
> running a task. Parsing multiple python file in a smart sensor make the case 
> low efficiency and overload. Since sensor tasks need relatively more “light 
> weight” executing information -- less number of properties with simple 
> structure (most are built in type instead of function or object). We propose 
> to skip the parsing for smart sensor. The easiest way is to persist all task 
> instance information in airflow metaDB. 
> h3. Solution:
> It will be hard to dump the whole task instance object dictionary. And we do 
> not really need that much information. 
> We add two sets to the base sensor class as “persist_fields” and 
> “execute_fields”. 
> h4. “persist_fields”  dump to airflow.task_instance column “attr_dict”
> saves the attribute names that should be used to accomplish a sensor poking 
> job. For example:
>  #  the “NamedHivePartitionSensor” define its persist_fields as  
> ('partition_names', 'metastore_conn_id', 'hook') since these properties are 
> enough for its poking function. 
>  # While the HivePartitionSensor can be slightly different use persist_fields 
> as ('schema', 'table', 'partition', 'metastore_conn_id')
> If we have two tasks that have same property value for all field in 
> persist_fields. That means these two tasks are poking the same item and they 
> are holding duplicate poking jobs in senser. 
> *The persist_fields can help us in deduplicate sensor tasks*. In a more 
> broader way. If we can list persist_fields for all operators, it can help to 
> dedup all airflow tasks.
> h4. “Execute_fields” dump to airflow.task_instance column “exec_dict”
> Saves the execution configuration such as “poke_interval”, “timeout”, 
> “execution_timeout”
> Fields in this set do not contain information affecting the poking job 
> detail. They are related to how frequent should we poke, when should the task 
> timeout, how many times timeout should be a fail etc. We only put those logic 
> that we can easily handle in a smart sensor for now. This is a smart sensor 
> “doable whitelist” and can be extended with more logic being “unlocked” by 
> smart sensor implementation. 
>  When 

[jira] [Commented] (AIRFLOW-3964) Consolidate and de-duplicate sensor tasks

2020-09-08 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192497#comment-17192497
 ] 

ASF subversion and git services commented on AIRFLOW-3964:
--

Commit ac943c9e18f75259d531dbda8c51e650f57faa4c in airflow's branch 
refs/heads/master from Yingbo Wang
[ https://gitbox.apache.org/repos/asf?p=airflow.git;h=ac943c9 ]

[AIRFLOW-3964][AIP-17] Consolidate and de-dup sensor tasks using Smart Sensor 
(#5499)

Co-authored-by: Yingbo Wang 

> Consolidate and de-duplicate sensor tasks 
> --
>
> Key: AIRFLOW-3964
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3964
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: dependencies, operators, scheduler
>Affects Versions: 1.10.0
>Reporter: Yingbo Wang
>Assignee: Yingbo Wang
>Priority: Critical
>
> h2. Problem
> h3. Airflow Sensor:
> Sensors are a certain type of operator that will keep running until a certain 
> criterion is met. Examples include a specific file landing in HDFS or S3, a 
> partition appearing in Hive, or a specific time of the day. Sensors are 
> derived from BaseSensorOperator and run a poke method at a specified 
> poke_interval until it returns True.
> Airflow Sensor duplication is a normal problem for large scale airflow 
> project. There are duplicated partitions needing to be detected from 
> same/different DAG. In Airbnb there are 88 boxes running four different types 
> of sensors everyday. The number of running sensor tasks ranges from 8k to 
> 16k, which takes great amount of resources. Although Airflow team had 
> redirected all sensors to a specific queue to allocate relatively minor 
> resource, there is still large room to reduce the number of workers and 
> relief DB pressure by optimizing the sensor mechanism.
> Existing sensor implementation creates an identical task for any sensor task 
> with specific dag_id, task_id and execution_date. This task is responsible of 
> keeping querying DB until the specified partitions exists. Even if two tasks 
> are waiting for same partition in DB, they are creating two connections with 
> the DB and checking the status in two separate processes. In one hand, DB 
> need to run duplicate jobs in multiple processes which will take both cpu and 
> memory resources. At the same time, Airflow need to maintain a process for 
> each sensor to query and wait for the partition/table to be created.
> h1. ***Design*
> There are several issues need to be resolved for our smart sensor. 
> h2. Persist sensor infor in DB and avoid file parsing before running
> Current Airflow implementation need to parse the DAG python file before 
> running a task. Parsing multiple python file in a smart sensor make the case 
> low efficiency and overload. Since sensor tasks need relatively more “light 
> weight” executing information -- less number of properties with simple 
> structure (most are built in type instead of function or object). We propose 
> to skip the parsing for smart sensor. The easiest way is to persist all task 
> instance information in airflow metaDB. 
> h3. Solution:
> It will be hard to dump the whole task instance object dictionary. And we do 
> not really need that much information. 
> We add two sets to the base sensor class as “persist_fields” and 
> “execute_fields”. 
> h4. “persist_fields”  dump to airflow.task_instance column “attr_dict”
> saves the attribute names that should be used to accomplish a sensor poking 
> job. For example:
>  #  the “NamedHivePartitionSensor” define its persist_fields as  
> ('partition_names', 'metastore_conn_id', 'hook') since these properties are 
> enough for its poking function. 
>  # While the HivePartitionSensor can be slightly different use persist_fields 
> as ('schema', 'table', 'partition', 'metastore_conn_id')
> If we have two tasks that have same property value for all field in 
> persist_fields. That means these two tasks are poking the same item and they 
> are holding duplicate poking jobs in senser. 
> *The persist_fields can help us in deduplicate sensor tasks*. In a more 
> broader way. If we can list persist_fields for all operators, it can help to 
> dedup all airflow tasks.
> h4. “Execute_fields” dump to airflow.task_instance column “exec_dict”
> Saves the execution configuration such as “poke_interval”, “timeout”, 
> “execution_timeout”
> Fields in this set do not contain information affecting the poking job 
> detail. They are related to how frequent should we poke, when should the task 
> timeout, how many times timeout should be a fail etc. We only put those logic 
> that we can easily handle in a smart sensor for now. This is a smart sensor 
> “doable whitelist” and can be extended with more logic being “unlocked” by 
> smart sensor implementation. 
>  When 

[airflow] branch master updated: [AIRFLOW-3964][AIP-17] Consolidate and de-dup sensor tasks using Smart Sensor (#5499)

2020-09-08 Thread kaxilnaik
This is an automated email from the ASF dual-hosted git repository.

kaxilnaik pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/airflow.git


The following commit(s) were added to refs/heads/master by this push:
 new ac943c9  [AIRFLOW-3964][AIP-17] Consolidate and de-dup sensor tasks 
using Smart Sensor (#5499)
ac943c9 is described below

commit ac943c9e18f75259d531dbda8c51e650f57faa4c
Author: Yingbo Wang 
AuthorDate: Tue Sep 8 14:47:59 2020 -0700

[AIRFLOW-3964][AIP-17] Consolidate and de-dup sensor tasks using Smart 
Sensor (#5499)

Co-authored-by: Yingbo Wang 
---
 airflow/config_templates/config.yml|  33 +
 airflow/config_templates/default_airflow.cfg   |  15 +
 airflow/exceptions.py  |   7 +
 airflow/jobs/scheduler_job.py  |   7 +-
 .../e38be357a868_update_schema_for_smart_sensor.py |  92 +++
 airflow/models/__init__.py |   1 +
 airflow/models/baseoperator.py |   7 +
 airflow/models/dagbag.py   |  12 +-
 airflow/models/sensorinstance.py   | 166 +
 airflow/models/taskinstance.py | 121 +++-
 .../apache/hive/sensors/metastore_partition.py |   1 +
 .../apache/hive/sensors/named_hive_partition.py|  10 +
 .../providers/elasticsearch/log/es_task_handler.py |  40 +-
 airflow/sensors/base_sensor_operator.py|  74 +-
 airflow/sensors/smart_sensor_operator.py   | 764 +
 airflow/smart_sensor_dags/__init__.py  |  17 +
 airflow/smart_sensor_dags/smart_sensor_group.py|  63 ++
 airflow/utils/file.py  |  16 +-
 airflow/utils/log/file_processor_handler.py|  15 +-
 airflow/utils/log/file_task_handler.py |  22 +-
 airflow/utils/log/log_reader.py|   3 +-
 airflow/utils/state.py |  25 +-
 airflow/www/static/css/graph.css   |   4 +
 airflow/www/static/css/tree.css|   4 +
 airflow/www/templates/airflow/ti_log.html  |  28 +-
 docs/img/smart_sensor_architecture.png | Bin 0 -> 80325 bytes
 docs/img/smart_sensor_single_task_execute_flow.png | Bin 0 -> 75462 bytes
 docs/index.rst |   1 +
 docs/logging-monitoring/metrics.rst|   5 +
 docs/operators-and-hooks-ref.rst   |   4 +
 docs/smart-sensor.rst  |  86 +++
 tests/api_connexion/endpoints/test_log_endpoint.py |  14 +-
 tests/jobs/test_scheduler_job.py   |  22 +-
 tests/models/test_dagbag.py|   8 +-
 tests/models/test_sensorinstance.py|  46 ++
 .../amazon/aws/log/test_cloudwatch_task_handler.py |   9 +-
 .../amazon/aws/log/test_s3_task_handler.py |  11 +-
 .../elasticsearch/log/test_es_task_handler.py  |  20 +-
 .../microsoft/azure/log/test_wasb_task_handler.py  |  10 +-
 tests/sensors/test_smart_sensor_operator.py| 326 +
 tests/test_config_templates.py |   3 +-
 tests/utils/log/test_log_reader.py |  58 +-
 tests/utils/test_log_handlers.py   |   2 +-
 tests/www/test_views.py|  22 +-
 44 files changed, 2062 insertions(+), 132 deletions(-)

diff --git a/airflow/config_templates/config.yml 
b/airflow/config_templates/config.yml
index 10b8987..25b24da 100644
--- a/airflow/config_templates/config.yml
+++ b/airflow/config_templates/config.yml
@@ -2337,3 +2337,36 @@
 to identify the task.
 Should be supplied in the format: ``key = value``
   options: []
+- name: smart_sensor
+  description: ~
+  options:
+- name: use_smart_sensor
+  description: |
+When `use_smart_sensor` is True, Airflow redirects multiple qualified 
sensor tasks to
+smart sensor task.
+  version_added: ~
+  type: boolean
+  example: ~
+  default: "False"
+- name: shard_code_upper_limit
+  description: |
+`shard_code_upper_limit` is the upper limit of `shard_code` value. The 
`shard_code` is generated
+by `hashcode % shard_code_upper_limit`.
+  version_added: ~
+  type: int
+  example: ~
+  default: "1"
+- name: shards
+  description: |
+The number of running smart sensor processes for each service.
+  version_added: ~
+  type: int
+  example: ~
+  default: "5"
+- name: sensors_enabled
+  description: |
+comma separated sensor classes support in smart_sensor.
+  version_added: ~
+  type: string
+  example: ~
+  default: "NamedHivePartitionSensor"
diff --git a/airflow/config_templates/default_airflow.cfg 
b/airflow/config_templates/default_airflow.cfg
index d5c5262..24ba5a1 100644
--- a/airflow/config_templates/default_airflow.cfg
+++ b/airflow/config_templates/default_airflow.cfg

[jira] [Commented] (AIRFLOW-3964) Consolidate and de-duplicate sensor tasks

2020-09-08 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-3964?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17192495#comment-17192495
 ] 

ASF GitHub Bot commented on AIRFLOW-3964:
-

kaxil merged pull request #5499:
URL: https://github.com/apache/airflow/pull/5499


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Consolidate and de-duplicate sensor tasks 
> --
>
> Key: AIRFLOW-3964
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3964
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: dependencies, operators, scheduler
>Affects Versions: 1.10.0
>Reporter: Yingbo Wang
>Assignee: Yingbo Wang
>Priority: Critical
>
> h2. Problem
> h3. Airflow Sensor:
> Sensors are a certain type of operator that will keep running until a certain 
> criterion is met. Examples include a specific file landing in HDFS or S3, a 
> partition appearing in Hive, or a specific time of the day. Sensors are 
> derived from BaseSensorOperator and run a poke method at a specified 
> poke_interval until it returns True.
> Airflow Sensor duplication is a normal problem for large scale airflow 
> project. There are duplicated partitions needing to be detected from 
> same/different DAG. In Airbnb there are 88 boxes running four different types 
> of sensors everyday. The number of running sensor tasks ranges from 8k to 
> 16k, which takes great amount of resources. Although Airflow team had 
> redirected all sensors to a specific queue to allocate relatively minor 
> resource, there is still large room to reduce the number of workers and 
> relief DB pressure by optimizing the sensor mechanism.
> Existing sensor implementation creates an identical task for any sensor task 
> with specific dag_id, task_id and execution_date. This task is responsible of 
> keeping querying DB until the specified partitions exists. Even if two tasks 
> are waiting for same partition in DB, they are creating two connections with 
> the DB and checking the status in two separate processes. In one hand, DB 
> need to run duplicate jobs in multiple processes which will take both cpu and 
> memory resources. At the same time, Airflow need to maintain a process for 
> each sensor to query and wait for the partition/table to be created.
> h1. ***Design*
> There are several issues need to be resolved for our smart sensor. 
> h2. Persist sensor infor in DB and avoid file parsing before running
> Current Airflow implementation need to parse the DAG python file before 
> running a task. Parsing multiple python file in a smart sensor make the case 
> low efficiency and overload. Since sensor tasks need relatively more “light 
> weight” executing information -- less number of properties with simple 
> structure (most are built in type instead of function or object). We propose 
> to skip the parsing for smart sensor. The easiest way is to persist all task 
> instance information in airflow metaDB. 
> h3. Solution:
> It will be hard to dump the whole task instance object dictionary. And we do 
> not really need that much information. 
> We add two sets to the base sensor class as “persist_fields” and 
> “execute_fields”. 
> h4. “persist_fields”  dump to airflow.task_instance column “attr_dict”
> saves the attribute names that should be used to accomplish a sensor poking 
> job. For example:
>  #  the “NamedHivePartitionSensor” define its persist_fields as  
> ('partition_names', 'metastore_conn_id', 'hook') since these properties are 
> enough for its poking function. 
>  # While the HivePartitionSensor can be slightly different use persist_fields 
> as ('schema', 'table', 'partition', 'metastore_conn_id')
> If we have two tasks that have same property value for all field in 
> persist_fields. That means these two tasks are poking the same item and they 
> are holding duplicate poking jobs in senser. 
> *The persist_fields can help us in deduplicate sensor tasks*. In a more 
> broader way. If we can list persist_fields for all operators, it can help to 
> dedup all airflow tasks.
> h4. “Execute_fields” dump to airflow.task_instance column “exec_dict”
> Saves the execution configuration such as “poke_interval”, “timeout”, 
> “execution_timeout”
> Fields in this set do not contain information affecting the poking job 
> detail. They are related to how frequent should we poke, when should the task 
> timeout, how many times timeout should be a fail etc. We only put those logic 
> that we can easily handle in a smart sensor for now. This is a smart sensor 
> “doable whitelist” and can 

[airflow] branch master updated: [AIRFLOW-3964][AIP-17] Consolidate and de-dup sensor tasks using Smart Sensor (#5499)

2020-09-08 Thread kaxilnaik
This is an automated email from the ASF dual-hosted git repository.

kaxilnaik pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/airflow.git


The following commit(s) were added to refs/heads/master by this push:
 new ac943c9  [AIRFLOW-3964][AIP-17] Consolidate and de-dup sensor tasks 
using Smart Sensor (#5499)
ac943c9 is described below

commit ac943c9e18f75259d531dbda8c51e650f57faa4c
Author: Yingbo Wang 
AuthorDate: Tue Sep 8 14:47:59 2020 -0700

[AIRFLOW-3964][AIP-17] Consolidate and de-dup sensor tasks using Smart 
Sensor (#5499)

Co-authored-by: Yingbo Wang 
---
 airflow/config_templates/config.yml|  33 +
 airflow/config_templates/default_airflow.cfg   |  15 +
 airflow/exceptions.py  |   7 +
 airflow/jobs/scheduler_job.py  |   7 +-
 .../e38be357a868_update_schema_for_smart_sensor.py |  92 +++
 airflow/models/__init__.py |   1 +
 airflow/models/baseoperator.py |   7 +
 airflow/models/dagbag.py   |  12 +-
 airflow/models/sensorinstance.py   | 166 +
 airflow/models/taskinstance.py | 121 +++-
 .../apache/hive/sensors/metastore_partition.py |   1 +
 .../apache/hive/sensors/named_hive_partition.py|  10 +
 .../providers/elasticsearch/log/es_task_handler.py |  40 +-
 airflow/sensors/base_sensor_operator.py|  74 +-
 airflow/sensors/smart_sensor_operator.py   | 764 +
 airflow/smart_sensor_dags/__init__.py  |  17 +
 airflow/smart_sensor_dags/smart_sensor_group.py|  63 ++
 airflow/utils/file.py  |  16 +-
 airflow/utils/log/file_processor_handler.py|  15 +-
 airflow/utils/log/file_task_handler.py |  22 +-
 airflow/utils/log/log_reader.py|   3 +-
 airflow/utils/state.py |  25 +-
 airflow/www/static/css/graph.css   |   4 +
 airflow/www/static/css/tree.css|   4 +
 airflow/www/templates/airflow/ti_log.html  |  28 +-
 docs/img/smart_sensor_architecture.png | Bin 0 -> 80325 bytes
 docs/img/smart_sensor_single_task_execute_flow.png | Bin 0 -> 75462 bytes
 docs/index.rst |   1 +
 docs/logging-monitoring/metrics.rst|   5 +
 docs/operators-and-hooks-ref.rst   |   4 +
 docs/smart-sensor.rst  |  86 +++
 tests/api_connexion/endpoints/test_log_endpoint.py |  14 +-
 tests/jobs/test_scheduler_job.py   |  22 +-
 tests/models/test_dagbag.py|   8 +-
 tests/models/test_sensorinstance.py|  46 ++
 .../amazon/aws/log/test_cloudwatch_task_handler.py |   9 +-
 .../amazon/aws/log/test_s3_task_handler.py |  11 +-
 .../elasticsearch/log/test_es_task_handler.py  |  20 +-
 .../microsoft/azure/log/test_wasb_task_handler.py  |  10 +-
 tests/sensors/test_smart_sensor_operator.py| 326 +
 tests/test_config_templates.py |   3 +-
 tests/utils/log/test_log_reader.py |  58 +-
 tests/utils/test_log_handlers.py   |   2 +-
 tests/www/test_views.py|  22 +-
 44 files changed, 2062 insertions(+), 132 deletions(-)

diff --git a/airflow/config_templates/config.yml 
b/airflow/config_templates/config.yml
index 10b8987..25b24da 100644
--- a/airflow/config_templates/config.yml
+++ b/airflow/config_templates/config.yml
@@ -2337,3 +2337,36 @@
 to identify the task.
 Should be supplied in the format: ``key = value``
   options: []
+- name: smart_sensor
+  description: ~
+  options:
+- name: use_smart_sensor
+  description: |
+When `use_smart_sensor` is True, Airflow redirects multiple qualified 
sensor tasks to
+smart sensor task.
+  version_added: ~
+  type: boolean
+  example: ~
+  default: "False"
+- name: shard_code_upper_limit
+  description: |
+`shard_code_upper_limit` is the upper limit of `shard_code` value. The 
`shard_code` is generated
+by `hashcode % shard_code_upper_limit`.
+  version_added: ~
+  type: int
+  example: ~
+  default: "1"
+- name: shards
+  description: |
+The number of running smart sensor processes for each service.
+  version_added: ~
+  type: int
+  example: ~
+  default: "5"
+- name: sensors_enabled
+  description: |
+comma separated sensor classes support in smart_sensor.
+  version_added: ~
+  type: string
+  example: ~
+  default: "NamedHivePartitionSensor"
diff --git a/airflow/config_templates/default_airflow.cfg 
b/airflow/config_templates/default_airflow.cfg
index d5c5262..24ba5a1 100644
--- a/airflow/config_templates/default_airflow.cfg
+++ b/airflow/config_templates/default_airflow.cfg

[GitHub] [airflow] kaxil merged pull request #5499: [AIRFLOW-3964][AIP-17] Build smart sensor

2020-09-08 Thread GitBox


kaxil merged pull request #5499:
URL: https://github.com/apache/airflow/pull/5499


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] ashb commented on pull request #10806: Allow airflow.providers to be installed in multiple python folders

2020-09-08 Thread GitBox


ashb commented on pull request #10806:
URL: https://github.com/apache/airflow/pull/10806#issuecomment-689143748


   > Hey @ash - I cancelled it due to "watching the watchers problem" so it's 
best you rebase it :)
   
   :+1: Done.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] potiuk opened a new pull request #10813: Make airflow testing Google Shell Guide compatible

2020-09-08 Thread GitBox


potiuk opened a new pull request #10813:
URL: https://github.com/apache/airflow/pull/10813


   Part of #10576 
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request 
Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)**
 for more information.
   In case of fundamental code change, Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals))
 is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party 
License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in 
[UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] potiuk opened a new pull request #10812: Make vrious scripts Google Shell Guide compatible

2020-09-08 Thread GitBox


potiuk opened a new pull request #10812:
URL: https://github.com/apache/airflow/pull/10812


   Part of #10576
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request 
Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)**
 for more information.
   In case of fundamental code change, Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals))
 is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party 
License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in 
[UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] jedcunningham commented on a change in pull request #10663: Optional import error tracebacks in web ui

2020-09-08 Thread GitBox


jedcunningham commented on a change in pull request #10663:
URL: https://github.com/apache/airflow/pull/10663#discussion_r485198197



##
File path: airflow/models/dagbag.py
##
@@ -269,7 +270,12 @@ def _load_modules_from_file(self, filepath, safe_mode):
 return [new_module]
 except Exception as e:  # pylint: disable=broad-except
 self.log.exception("Failed to import: %s", filepath)
-self.import_errors[filepath] = str(e)
+if conf.getboolean('webserver', 
'dagbag_import_error_tracebacks'):

Review comment:
   Yeah, looking with fresh eyes, `core` is a much better fit.
   
   Think we should also set attributes like we do for `DAGBAG_IMPORT_TIMEOUT` 
so we only hit the config system once?
   e.g. 
https://github.com/apache/airflow/blob/master/airflow/models/dagbag.py#L80





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] FloChehab edited a comment on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


FloChehab edited a comment on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689133928


   So, I've set retry_delay to 10s. On scheduler restart the task is stucked in 
"running" state for ~4 minutes (while being "completed" on kubernetes side 
before scheduler restart) then it switches to up_for_retry and finally 10s 
later, everything is fine.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] FloChehab edited a comment on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


FloChehab edited a comment on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689124323


   @dimberman You were right ! After ~10 minutes it got picked out of 
"up_for_retry" state.
   
   I guess I was a bit confused by the logs showing that the scheduler is 
running and not taking up "up_for_retry" tasks.
   
   
[partial-scheduler.log](https://github.com/apache/airflow/files/5190960/scheduler.log)
   
   EDIT: must have been 5 minutes actually.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] FloChehab commented on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


FloChehab commented on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689133928


   So, I've set retry_delay to 10s. On scheduler restart the task is stucked in 
"running" state for ~4 minutes (while being "completed" on kubernetes side 
before scheduler restart) then in switches to up_for_retry and finally 10s 
later, everything is fine.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] potiuk opened a new pull request #10811: Make scripts/ci/tools Google Shell Guide Compatible

2020-09-08 Thread GitBox


potiuk opened a new pull request #10811:
URL: https://github.com/apache/airflow/pull/10811


   Part of #10576
   
   
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request 
Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)**
 for more information.
   In case of fundamental code change, Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals))
 is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party 
License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in 
[UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] potiuk opened a new pull request #10810: Make Clients Google Shell guide compatible

2020-09-08 Thread GitBox


potiuk opened a new pull request #10810:
URL: https://github.com/apache/airflow/pull/10810


   Part of #10576
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request 
Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)**
 for more information.
   In case of fundamental code change, Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals))
 is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party 
License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in 
[UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] FloChehab commented on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


FloChehab commented on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689127606


   And the default `retry_delay` seems to be 300s so everything seems to be ok. 
Let's just try with a shorter retry delay.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] FloChehab edited a comment on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


FloChehab edited a comment on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689124323


   @dimberman You were right ! After ~10 minutes it got picked out of 
"up_for_retry" state.
   
   I guess I was a bit confused by the logs showing that the scheduler is 
running and not taking up "up_for_retry" tasks.
   
   
[partial-scheduler.log](https://github.com/apache/airflow/files/5190960/scheduler.log)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] FloChehab edited a comment on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


FloChehab edited a comment on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689124323


   @dimberman You were right ! After ~10 minutes it got picked out of 
"up_for_retry" state.
   
   I guess I was a bit confused by the logs showing that the scheduler is 
running and taking up "up_for_retry" tasks.
   
   
[partial-scheduler.log](https://github.com/apache/airflow/files/5190960/scheduler.log)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] FloChehab commented on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


FloChehab commented on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689124323


   @dimberman You were right ! After ~10 minutes it got picked out of 
"up_for_retry" state.
   
   I guess I was a bit confused by the logs showing that it the scheduler is 
running and taking up "up_for_retry" tasks.
   
   
[partial-scheduler.log](https://github.com/apache/airflow/files/5190960/scheduler.log)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] potiuk commented on pull request #10708: Make breeeze-complete Google Shell Guide compatible

2020-09-08 Thread GitBox


potiuk commented on pull request #10708:
URL: https://github.com/apache/airflow/pull/10708#issuecomment-689124114


   Also only quarantined test failed.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] potiuk commented on pull request #10734: Make dockerfiles Google Shell Guide Compliant

2020-09-08 Thread GitBox


potiuk commented on pull request #10734:
URL: https://github.com/apache/airflow/pull/10734#issuecomment-689123857


   Just quarantined tests failed :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] FloChehab edited a comment on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


FloChehab edited a comment on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689114819


   > @FloChehab Ok that's a good sign (thank you btw). One more question, have 
you tried leaving the task in `up_for_retry` and seeing if the scheduler 
eventually picks it up?
   
   Ok, i'll try that last one (in 1.10.12 + LocalExecutor + 
is_delete_operator_pod=True -- otherwise I won't get the stuck in up for 
retry), take a swim and come back.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] FloChehab commented on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


FloChehab commented on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689114819


   > @FloChehab Ok that's a good sign (thank you btw). One more question, have 
you tried leaving the task in `up_for_retry` and seeing if the scheduler 
eventually picks it up?
   
   Ok, i'll try that last one (in 1.10.12 + is_delete_operator_pod=True -- 
otherwise I won't get the stuck in up for retry), take a swim and come back.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] dimberman commented on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


dimberman commented on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689113055


   @FloChehab Ok that's a good sign (thank you btw). One more question, have 
you tried leaving the task in `up_for_retry` and seeing if the scheduler 
eventually picks it up?
   
   @ashb @kaxil this seems like it might just be the scheduler retry_timeout 
yeah? Like the clock to retry a failed task starts when the scheduler restarts 
and just takes a few minutes? 



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] FloChehab commented on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


FloChehab commented on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689112499


   And the scheduler logs on restart:
   
[scheduler.log](https://github.com/apache/airflow/files/5190839/scheduler.log)
   
   @dimberman I have to stop my investigations for today, but I'll be more than 
happy to help tomorrow.
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] FloChehab commented on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


FloChehab commented on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689111872


   So this time with image from v1-10-test + helm + KEDA:
   * Task is stuck on running on scheduler restart (no tasks are queued on 
redis)
   * If run + ignore all deps => task get queued on redis again, and is quickly 
set to success as expected.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[airflow] branch master updated (7fd65d7 -> ff41361)

2020-09-08 Thread kamilbregula
This is an automated email from the ASF dual-hosted git repository.

kamilbregula pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/airflow.git.


from 7fd65d7  Don't include kubernetes_tests/ and backport_packages/ in our 
wheel (#10805)
 add ff41361  Add task logging handler to airflow info command (#10771)

No new revisions were added by this update.

Summary of changes:
 airflow/cli/commands/info_command.py  | 31 
 docs/logging-monitoring/logging-tasks.rst | 48 ++-
 tests/cli/commands/test_info_command.py   | 20 +
 3 files changed, 81 insertions(+), 18 deletions(-)



[GitHub] [airflow] mik-laj merged pull request #10771: Add task logging handler to airflow info command

2020-09-08 Thread GitBox


mik-laj merged pull request #10771:
URL: https://github.com/apache/airflow/pull/10771


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] FloChehab commented on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


FloChehab commented on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689106510


   Just tested with 1.10.12 (while the image is building) and 
is_delete_operator_pod=false. This time the task seemed stucked in running on 
first scheduler restart. And I got this if I tried the suggested action. I 
guess I am going to test with a Celery setup: 
   ```
Only works with the Celery or Kubernetes executors, sorry 
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] FloChehab commented on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


FloChehab commented on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689101838


   (just need a bit more time to build the production image for 1.10-test)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] FloChehab commented on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


FloChehab commented on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689100097


   Ok let's see :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] dimberman commented on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


dimberman commented on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689099567


   @FloChehab what happens if you are running this with the helm chart, you get 
to the "up_for_retry" state, and then you manually rerun the task with "ignore 
all deps"
   
   https://user-images.githubusercontent.com/2644098/92521210-af50fa00-f1d1-11ea-9e11-88bad8e5d64f.png;>
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] FloChehab commented on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


FloChehab commented on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689097501


   So, with `is_delete_operator_pod=False` and doing the same process 
(including manually killing the zombie process), I do have the bug I was 
describing: it took me 4 scheduler restarts to have the task go from 
"up_for_retry" to "running", then quickly "success".
   
   
[scheduler.log](https://github.com/apache/airflow/files/5190669/scheduler.log)
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[airflow] branch master updated (2811851 -> 7fd65d7)

2020-09-08 Thread potiuk
This is an automated email from the ASF dual-hosted git repository.

potiuk pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/airflow.git.


from 2811851  Move Impersonation test back to quarantine (#10809)
 add 7fd65d7  Don't include kubernetes_tests/ and backport_packages/ in our 
wheel (#10805)

No new revisions were added by this update.

Summary of changes:
 setup.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)



[GitHub] [airflow] potiuk commented on pull request #10806: Allow airflow.providers to be installed in multiple python folders

2020-09-08 Thread GitBox


potiuk commented on pull request #10806:
URL: https://github.com/apache/airflow/pull/10806#issuecomment-689097218


   Hey @ash - I cancelled it due to "watching the watchers problem" so it's 
best you rebase it :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] potiuk merged pull request #10805: Don't include kubernetes_tests/ and backport_packages/ in our wheel

2020-09-08 Thread GitBox


potiuk merged pull request #10805:
URL: https://github.com/apache/airflow/pull/10805


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] potiuk commented on pull request #10784: Requirements might get upgraded without setup.py change

2020-09-08 Thread GitBox


potiuk commented on pull request #10784:
URL: https://github.com/apache/airflow/pull/10784#issuecomment-689095091


   Yeah. I noticed some unexpected behaviour with changes from yesterday and 
had to fix it first :).
   
   Pushed changes now, I also test it on my own fork (it can only be fully 
tested in master)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[airflow] branch master updated: Move Impersonation test back to quarantine (#10809)

2020-09-08 Thread potiuk
This is an automated email from the ASF dual-hosted git repository.

potiuk pushed a commit to branch master
in repository https://gitbox.apache.org/repos/asf/airflow.git


The following commit(s) were added to refs/heads/master by this push:
 new 2811851  Move Impersonation test back to quarantine (#10809)
2811851 is described below

commit 2811851f80d6f5852d2401f7c57d2e4520b4f2ab
Author: Jarek Potiuk 
AuthorDate: Tue Sep 8 21:33:44 2020 +0200

Move Impersonation test back to quarantine (#10809)

Seems that TestImpersonation is not stable even in isolation
Moving it back to quarantine for now.
---
 tests/test_impersonation.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/tests/test_impersonation.py b/tests/test_impersonation.py
index 30df63c..56c5cdf 100644
--- a/tests/test_impersonation.py
+++ b/tests/test_impersonation.py
@@ -109,7 +109,7 @@ def create_user():
 )
 
 
-@pytest.mark.heisentests
+@pytest.mark.quarantined
 class TestImpersonation(unittest.TestCase):
 
 def setUp(self):



[GitHub] [airflow] potiuk merged pull request #10809: Move Impersonation test back to quarantine

2020-09-08 Thread GitBox


potiuk merged pull request #10809:
URL: https://github.com/apache/airflow/pull/10809


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] dimberman commented on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


dimberman commented on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689087705


   Yeah agreed. For now if you set is_delete_operator_pod to false it fixes it.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] FloChehab commented on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


FloChehab commented on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689087054


   Hum, I am not sure I would do that. I think that the life of the worker / 
"object" that is starting / monitoring / etc. the pod shouldn't impact the pod 
itself (we have usecases with very long jobs started from airflow on kubernetes 
and I don't think it would play nicely with this)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] feluelle commented on a change in pull request #10784: Requirements might get upgraded without setup.py change

2020-09-08 Thread GitBox


feluelle commented on a change in pull request #10784:
URL: https://github.com/apache/airflow/pull/10784#discussion_r485144932



##
File path: .github/workflows/ci.yml
##
@@ -35,7 +35,7 @@ env:
   SKIP_CHECK_REMOTE_IMAGE: "true"
   DB_RESET: "true"
   VERBOSE: "true"
-  UPGRADE_TO_LATEST_CONSTRAINTS: ${{ github.event_name == 'push' || 
github.event_name == 'scheduled' }}
+  UPGRADE_TO_LATEST_CONSTRAINTS: ${{ githubgithub.event_name == 'push' || 
github.event_name == 'scheduled' }}

Review comment:
   ```suggestion
 UPGRADE_TO_LATEST_CONSTRAINTS: ${{ github.event_name == 'push' || 
github.event_name == 'scheduled' }}
   ```

##
File path: .github/workflows/build-images-workflow-run.yml
##
@@ -141,6 +142,15 @@ jobs:
   else
   echo "::set-output name=docker-cache::pulled"
   fi
+  - name: "Set upgrade to latest constaints"

Review comment:
   ```suggestion
 - name: "Set upgrade to latest constraints"
   ```





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] dimberman commented on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


dimberman commented on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689085620


   so if it recieves an error from a SIGTERM it deletes the pod because of 
`is_delete_operator_pod`



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] dimberman commented on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


dimberman commented on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689085429


   Oh wait it's not the on_kill.
   
   It's these lines
   
   ```  
 try:
   launcher.start_pod(
   pod,
   startup_timeout=self.startup_timeout_seconds)
   final_state, result = launcher.monitor_pod(pod=pod, 
get_logs=self.get_logs)
   except AirflowException as ex:
   if self.log_events_on_failure:
   for event in launcher.read_pod_events(pod).items:
   self.log.error("Pod Event: %s - %s", event.reason, 
event.message)
   raise AirflowException('Pod Launching failed: 
{error}'.format(error=ex))
   finally:
   if self.is_delete_operator_pod:
   launcher.delete_pod(pod)
   return final_state, pod, result
   ```



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] kaxil commented on pull request #9847: Support for set in XCom serialization (fix #8703)

2020-09-08 Thread GitBox


kaxil commented on pull request #9847:
URL: https://github.com/apache/airflow/pull/9847#issuecomment-689083911


   any updates?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] dimberman commented on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


dimberman commented on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689082795


   So ok, funny enough, I think because we added an on_kill to the 
KubernetesPodOperator, it now kills the pod if the process dies. Not sure if 
that counts as a solution or not, gonna need to think about this.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] dimberman commented on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


dimberman commented on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689079839


   Hmm... this might have to do with airflow leaving behind a zombie process, 
so it's harder to get a real interruption when running locally. Will test that 
now.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] FloChehab edited a comment on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


FloChehab edited a comment on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689078932


   So I have something a bit magical going on:
   * Same setup as you,
   * Same process.
   
   However I don't even need to restart the webserver or the scheduler:
   
   * I start the task
   * Wait for the pod to be in running state (on a remote cluster)
   * Stop the scheduler / webserver
   * Check on db the state of task => runnning
   * Pod finishes
   * State on db => success
   * Pods is deleted
   
   I don't really get what is going on ; nor how a pod in a remote cluster can 
talk to my local airflow db (which shouldn't be what is going on). I must have 
some airflow process in the background monitoring the pod, but I can't seem to 
find it... Too many weird stuff going on.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] turbaszek commented on a change in pull request #10802: Rename "Beyond the Horizon" section and refactor content

2020-09-08 Thread GitBox


turbaszek commented on a change in pull request #10802:
URL: https://github.com/apache/airflow/pull/10802#discussion_r485137457



##
File path: README.md
##
@@ -132,20 +132,13 @@ Other ways of retrieving source code are "convenience" 
methods. For example, tag
 
 > Note: Airflow Summit 2020's "Production Docker Image" talk where context, 
 > architecture and customization/extension methods are 
 > [explained](https://youtu.be/wDr3Y7q2XoI).
 
-## Beyond the Horizon
+## Project Focus
 
-Airflow **is not** a data streaming solution. Tasks do not move data from
-one to the other (though tasks can exchange metadata!). Airflow is not
-in the [Spark Streaming](http://spark.apache.org/streaming/)
-or [Storm](https://storm.apache.org/) space, it is more comparable to
-[Oozie](http://oozie.apache.org/) or
-[Azkaban](https://azkaban.github.io/).
+Airflow works best with DAGs that are mostly static and slowly changing. When 
the DAG structure is similarly from one run to the next, it allows for clarity 
around unit of work and continuity. Other similar projects include 
[Luigi](https://github.com/spotify/luigi), [Oozie](http://oozie.apache.org/) 
and [Azkaban](https://azkaban.github.io/).
 
-Workflows are expected to be mostly static or slowly changing. You can think
-of the structure of the tasks in your workflow as slightly more dynamic
-than a database structure would be. Airflow workflows are expected to look
-similar from a run to the next, this allows for clarity around
-unit of work and continuity.
+Airflow tasks are ideally idempotent, and do not pass large quantities of data 
from one task to the next (though tasks can pass metadata using Airflow's [Xcom 
feature](https://airflow.apache.org/docs/stable/concepts.html#xcoms)).

Review comment:
   You are right, I just think it would be wise to emphasize that it not 
petabytes of data





This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] FloChehab commented on pull request #10230: Fix KubernetesPodOperator reattachment

2020-09-08 Thread GitBox


FloChehab commented on pull request #10230:
URL: https://github.com/apache/airflow/pull/10230#issuecomment-689078932


   So I have something a bit magical going on:
   * Same setup as you,
   * Same process.
   
   However I don't even need to restart the webserver or the scheduler:
   
   * I start the task
   * Wait for the pod to be in running state (on a remote cluster)
   * Stop the scheduler / webserver
   * Check on db the state of task => runnning
   * Pod finishes
   * State on db => success
   * Pods is deleted
   
   I don't really get what is going on ; nor how a pod in a remote cluster can 
talk to my local airflow db (which shouldn't be what is going on). I must have 
some airflow process in the background monitoring the pod, but I can't seem to 
find it... Too much weird stuff going on.



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] potiuk commented on pull request #10809: Move Impersonation test back to quarantine

2020-09-08 Thread GitBox


potiuk commented on pull request #10809:
URL: https://github.com/apache/airflow/pull/10809#issuecomment-689075876


   Test Impersonation is failing far too often even in isolation. Moving it 
back for now. :(



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] potiuk opened a new pull request #10809: Move Impersonation test back to quarantine

2020-09-08 Thread GitBox


potiuk opened a new pull request #10809:
URL: https://github.com/apache/airflow/pull/10809


   Seems that TestImpersonation is not stable even in isolation
   Moving it back to quarantine for now.
   
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request 
Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)**
 for more information.
   In case of fundamental code change, Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals))
 is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party 
License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in 
[UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[airflow] branch master updated (3c6fdd8 -> c60fccc)

2020-09-08 Thread potiuk
This is an automated email from the ASF dual-hosted git repository.

potiuk pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/airflow.git.


from 3c6fdd8  Make ci/backport_packages Google Shell guide compliant 
(#10733)
 add c60fccc  Fix integration tests being accidentally excluded (#10807)

No new revisions were added by this update.

Summary of changes:
 scripts/ci/libraries/_initialization.sh  | 2 --
 scripts/ci/testing/ci_run_airflow_testing.sh | 6 --
 2 files changed, 4 insertions(+), 4 deletions(-)



[airflow] branch master updated (3c6fdd8 -> c60fccc)

2020-09-08 Thread potiuk
This is an automated email from the ASF dual-hosted git repository.

potiuk pushed a change to branch master
in repository https://gitbox.apache.org/repos/asf/airflow.git.


from 3c6fdd8  Make ci/backport_packages Google Shell guide compliant 
(#10733)
 add c60fccc  Fix integration tests being accidentally excluded (#10807)

No new revisions were added by this update.

Summary of changes:
 scripts/ci/libraries/_initialization.sh  | 2 --
 scripts/ci/testing/ci_run_airflow_testing.sh | 6 --
 2 files changed, 4 insertions(+), 4 deletions(-)



[GitHub] [airflow] potiuk merged pull request #10807: Fix integration tests being accidentally excluded

2020-09-08 Thread GitBox


potiuk merged pull request #10807:
URL: https://github.com/apache/airflow/pull/10807


   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] marcjimz commented on pull request #10808: Adding write_file, write_string and allow for kwargs in initialization of Wasbhook

2020-09-08 Thread GitBox


marcjimz commented on pull request #10808:
URL: https://github.com/apache/airflow/pull/10808#issuecomment-689059424


   Hi @mik-laj , thoughts on this one?



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] potiuk commented on pull request #10807: Fix integration tests being accidentally excluded

2020-09-08 Thread GitBox


potiuk commented on pull request #10807:
URL: https://github.com/apache/airflow/pull/10807#issuecomment-689056179


   > Quis custodiet ipsos custodes? :)
   
   That's why we'll never run out of work. There will always be some layer 
where human is needed :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] ashb commented on pull request #10807: Fix integration tests being accidentally excluded

2020-09-08 Thread GitBox


ashb commented on pull request #10807:
URL: https://github.com/apache/airflow/pull/10807#issuecomment-689055297


   Quis custodiet ipsos custodes? :)



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] marcjimz commented on pull request #10808: Adding write_file, write_string and allow for kwargs in initialization of Wasbhook

2020-09-08 Thread GitBox


marcjimz commented on pull request #10808:
URL: https://github.com/apache/airflow/pull/10808#issuecomment-689055075


   I have made the equivalent changes for Airflow 2.0 and it is available here: 
https://github.com/marcjimz/airflow/blob/master/airflow/providers/microsoft/azure/hooks/wasb.py



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] boring-cyborg[bot] commented on pull request #10808: Adding write_file, write_string and allow for kwargs in initialization of Wasbhook

2020-09-08 Thread GitBox


boring-cyborg[bot] commented on pull request #10808:
URL: https://github.com/apache/airflow/pull/10808#issuecomment-689054826


   Congratulations on your first Pull Request and welcome to the Apache Airflow 
community! If you have any issues or are unsure about any anything please check 
our Contribution Guide 
(https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst)
   Here are some useful points:
   - Pay attention to the quality of your code (flake8, pylint and type 
annotations). Our [pre-commits]( 
https://github.com/apache/airflow/blob/master/STATIC_CODE_CHECKS.rst#prerequisites-for-pre-commit-hooks)
 will help you with that.
   - In case of a new feature add useful documentation (in docstrings or in 
`docs/` directory). Adding a new operator? Check this short 
[guide](https://github.com/apache/airflow/blob/master/docs/howto/custom-operator.rst)
 Consider adding an example DAG that shows how users should use it.
   - Consider using [Breeze 
environment](https://github.com/apache/airflow/blob/master/BREEZE.rst) for 
testing locally, it’s a heavy docker but it ships with a working Airflow and a 
lot of integrations.
   - Be patient and persistent. It might take some time to get a review or get 
the final approval from Committers.
   - Please follow [ASF Code of 
Conduct](https://www.apache.org/foundation/policies/conduct) for all 
communication including (but not limited to) comments on Pull Requests, Mailing 
list and Slack.
   - Be sure to read the [Airflow Coding style]( 
https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#coding-style-and-best-practices).
   Apache Airflow is a community-driven project and together we are making it 
better .
   In case of doubts contact the developers at:
   Mailing List: d...@airflow.apache.org
   Slack: https://apache-airflow-slack.herokuapp.com/
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] marcjimz opened a new pull request #10808: Adding write_file, write_string and allow for kwargs in initialization of Wasbhook

2020-09-08 Thread GitBox


marcjimz opened a new pull request #10808:
URL: https://github.com/apache/airflow/pull/10808


   This PR addresses the following changes:
   
   1. Write a file from path to Azure Blob storage (WASBS)
   2. Write a file from text to Azure Blob storage (WASBS)
   3. Allow for kwargs to be passed into the WasbHook constructor
   
   We need the ability to write back to Blob storage and this covers a missing 
gap in this hook.
   
   Regarding the kwargs, we require the AirFlow pipelines to be generalizable 
at runtime, and this cannot be done without an existing connection made that 
authorizes the connection. This fix provides an alternative way to provide 
access. We prefer these connections to be ephemeral in nature, and not stored 
for longtime, and we can do this when executing a workload.
   
   It should also be noted that these credentials should not be stored in the 
DAG itself but passed in as a property, such as a SAS Token.
   
   
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request 
Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)**
 for more information.
   In case of fundamental code change, Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals))
 is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party 
License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in 
[UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




[GitHub] [airflow] potiuk opened a new pull request #10807: Fix integration tests being accidentally excluded

2020-09-08 Thread GitBox


potiuk opened a new pull request #10807:
URL: https://github.com/apache/airflow/pull/10807


   The change from #10769 accidentally switched Integration tests
   into far-longer run unit tests (we effectively run the tests
   twice and did not run integration tests.
   
   This fixes the problem by removing readonly status from
   INTEGRATIONS and only setting it after the integrations are
   set.
   
   
   ---
   **^ Add meaningful description above**
   
   Read the **[Pull Request 
Guidelines](https://github.com/apache/airflow/blob/master/CONTRIBUTING.rst#pull-request-guidelines)**
 for more information.
   In case of fundamental code change, Airflow Improvement Proposal 
([AIP](https://cwiki.apache.org/confluence/display/AIRFLOW/Airflow+Improvements+Proposals))
 is needed.
   In case of a new dependency, check compliance with the [ASF 3rd Party 
License Policy](https://www.apache.org/legal/resolved.html#category-x).
   In case of backwards incompatible changes please leave a note in 
[UPDATING.md](https://github.com/apache/airflow/blob/master/UPDATING.md).
   



This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org




  1   2   3   >