[jira] [Created] (AIRFLOW-6586) GCSUploadSessionCompleteSensor breaks in reschedule mode.

2020-01-17 Thread Jacob Ferriero (Jira)
Jacob Ferriero created AIRFLOW-6586:
---

 Summary: GCSUploadSessionCompleteSensor breaks in reschedule mode.
 Key: AIRFLOW-6586
 URL: https://issues.apache.org/jira/browse/AIRFLOW-6586
 Project: Apache Airflow
  Issue Type: Bug
  Components: operators
Affects Versions: 1.10.3
Reporter: Jacob Ferriero


This sensor is stateful and loses state between reschedules. 

We should: 
 # Warn about this in docstring
 # Add a `poke_mode_only` class decorator for sensors that aren't safe in 
reschedule mode.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (AIRFLOW-5568) Add Hook / Operators for GCP Healthcare API

2019-09-28 Thread Jacob Ferriero (Jira)
Jacob Ferriero created AIRFLOW-5568:
---

 Summary: Add Hook / Operators for GCP Healthcare API
 Key: AIRFLOW-5568
 URL: https://issues.apache.org/jira/browse/AIRFLOW-5568
 Project: Apache Airflow
  Issue Type: New Feature
  Components: hooks, operators
Affects Versions: 1.10.5
Reporter: Jacob Ferriero


It'd be useful to have a hook for the healthcare api

and some operators / sensor for the long running operations 
(https://cloud.google.com/healthcare/docs/how-tos/long-running-operations)
 * import / export of various formats
 * deidentification of datasets

 [https://cloud.google.com/healthcare/docs/apis]

 

Note this would be a good candidate to illustrate some sort of AysncOperator 
described in AIRFLOW-5567



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (AIRFLOW-5567) Improved primitive for building Operators that benefit from reschedule mode

2019-09-27 Thread Jacob Ferriero (Jira)
Jacob Ferriero created AIRFLOW-5567:
---

 Summary: Improved primitive for building Operators that benefit 
from reschedule mode
 Key: AIRFLOW-5567
 URL: https://issues.apache.org/jira/browse/AIRFLOW-5567
 Project: Apache Airflow
  Issue Type: Improvement
  Components: models, operators
Affects Versions: 1.10.5
Reporter: Jacob Ferriero
Assignee: Jacob Ferriero


Often times airflow operators (derived from BaseOperator) kick-off a long 
running tasks and then waits / polls, blocking a worker slot until the long 
running task completes. This can be problematic in environments with many long 
running tasks.

BaseSensorOperator was improved by implementing `reschedule` mode to solve the 
similar issue with long running sensors blocking a worker to poll for a long 
time.

This issue is to track how we could provide a primitive that would make it easy 
to develop operators for long running tasks that reschedule a `poll` operation 
rather than blocking in their `execute` method.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (AIRFLOW-5520) DataflowPythonOperator dependency management requires side effects

2019-09-18 Thread Jacob Ferriero (Jira)
Jacob Ferriero created AIRFLOW-5520:
---

 Summary: DataflowPythonOperator dependency management requires 
side effects
 Key: AIRFLOW-5520
 URL: https://issues.apache.org/jira/browse/AIRFLOW-5520
 Project: Apache Airflow
  Issue Type: Improvement
  Components: gcp
Affects Versions: 1.10.2
Reporter: Jacob Ferriero


When using DataflowPythonOperator it is difficult to manage apache beam 
version, (and other python dependencies) without affecting your entire airflow 
environment. It seems the Dataflow hook just submits a subprocess and python 

The operator / hook should be improved to isolate python dependencies for 
running run py_file.

Perhaps this could be achieved in a virtual environment (similar to 
PythonVirtualEnvOperator).

For beam it's often customary to specify a --requirements_file or --setup_file 
to manage python dependencies, we could run one of these in the venv to get it 
setup. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (AIRFLOW-4983) DataflowPythonOperator should be able to submit pipelines with python3

2019-07-17 Thread Jacob Ferriero (JIRA)
Jacob Ferriero created AIRFLOW-4983:
---

 Summary: DataflowPythonOperator should be able to submit pipelines 
with python3
 Key: AIRFLOW-4983
 URL: https://issues.apache.org/jira/browse/AIRFLOW-4983
 Project: Apache Airflow
  Issue Type: Improvement
  Components: gcp, hooks, operators
Affects Versions: 1.10.2, 1.10.4, 2.0.0, 1.10.5
Reporter: Jacob Ferriero
Assignee: Jacob Ferriero


Currently the DataflowHook hard codes python2 interpreter.

Apache Beam is beginning to support python3 interpreter and we should support 
submitting those pipelines.

I've we should add a `py_interpreter` arg to the operator and hook that 
defaults to 'python2' (to not be interface breaking.



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Created] (AIRFLOW-4397) Add GCSUploadSessionCompleteSensor

2019-04-23 Thread Jacob Ferriero (JIRA)
Jacob Ferriero created AIRFLOW-4397:
---

 Summary: Add GCSUploadSessionCompleteSensor
 Key: AIRFLOW-4397
 URL: https://issues.apache.org/jira/browse/AIRFLOW-4397
 Project: Apache Airflow
  Issue Type: New Feature
  Components: contrib
Reporter: Jacob Ferriero
Assignee: Jacob Ferriero


I'd like to contribute a Sensor for Google Cloud Storage that can poke a bucket 
until there has been sufficient time without a new file drop. Often times there 
are cases where a third party vendor drops data to a bucket but don't send a 
success flag when they are done. This sensor would allow you to poke every n 
minutes to check if more files have been added since the last poke, and if 
there had been `inactivity_period` minutes without a new file drop, return 
`True`. This could allow SLA misses if data did not arrive by an expected time, 
and have a configurable deadline past which the sensor would fail. Optionally 
the user could specify a minimum number of files for the sensor to succeed. 
This would be my first time contributing to an OSS project, so please let me 
know if this is not the appropriate place to start.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)