[jira] [Created] (AIRFLOW-3502) Add config option to control celery pool used

2018-12-11 Thread Gabriel Silk (JIRA)
Gabriel Silk created AIRFLOW-3502:
-

 Summary: Add config option to control celery pool used
 Key: AIRFLOW-3502
 URL: https://issues.apache.org/jira/browse/AIRFLOW-3502
 Project: Apache Airflow
  Issue Type: Improvement
  Components: celery
Reporter: Gabriel Silk


This adds a config option for the "pool" to allow uses to specify "prefork" vs 
"solo" etc. This is particularly useful with infrastructures that don't play 
nicely with celery's default prefork multi-processing model.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-3418) Task stuck in running state, unable to clear

2018-12-06 Thread Gabriel Silk (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-3418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711972#comment-16711972
 ] 

Gabriel Silk commented on AIRFLOW-3418:
---

I'm seeing this issue as well, and I would re-iterate the criticality of this 
issue. It's currently breaking our production clusters.

> Task stuck in running state, unable to clear
> 
>
> Key: AIRFLOW-3418
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3418
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: worker
>Affects Versions: 1.10.1
>Reporter: James Meickle
>Priority: Critical
>
> One of our tasks (a custom operator that sleep-waits until NYSE market close) 
> got stuck in a "running" state in the metadata db without making any 
> progress. This is what it looked like in the logs:
> {code:java}
> [2018-11-29 00:01:14,064] {{base_task_runner.py:101}} INFO - Job 38275: 
> Subtask after_close [2018-11-29 00:01:14,063] {{cli.py:484}} INFO - Running 
>  [running]> on host airflow-core-i-0a53cac37067d957d.dlg.fnd.dynoquant.com
> [2018-11-29 06:03:57,643] {{models.py:1355}} INFO - Dependencies not met for 
>  [running]>, dependency 'Task Instance State' FAILED: Task is in the 'running' 
> state which is not a valid state for execution. The task must be cleared in 
> order to be run.
> [2018-11-29 06:03:57,644] {{models.py:1355}} INFO - Dependencies not met for 
>  [running]>, dependency 'Task Instance Not Already Running' FAILED: Task is 
> already running, it started on 2018-11-29 00:01:10.876344+00:00.
> [2018-11-29 06:03:57,646] {{logging_mixin.py:95}} INFO - [2018-11-29 
> 06:03:57,646] {{jobs.py:2614}} INFO - Task is not able to be run
> {code}
> Seeing this state, we attempted to "clear" it in the web UI. This yielded a 
> complex backtrace:
> {code:java}
> Traceback (most recent call last):
>   File 
> "/home/airflow/virtualenvs/airflow/lib/python3.5/site-packages/flask/app.py", 
> line 1982, in wsgi_app
> response = self.full_dispatch_request()
>   File 
> "/home/airflow/virtualenvs/airflow/lib/python3.5/site-packages/flask/app.py", 
> line 1614, in full_dispatch_request
> rv = self.handle_user_exception(e)
>   File 
> "/home/airflow/virtualenvs/airflow/lib/python3.5/site-packages/flask/app.py", 
> line 1517, in handle_user_exception
> reraise(exc_type, exc_value, tb)
>   File 
> "/home/airflow/virtualenvs/airflow/lib/python3.5/site-packages/flask/_compat.py",
>  line 33, in reraise
> raise value
>   File 
> "/home/airflow/virtualenvs/airflow/lib/python3.5/site-packages/flask/app.py", 
> line 1612, in full_dispatch_request
> rv = self.dispatch_request()
>   File 
> "/home/airflow/virtualenvs/airflow/lib/python3.5/site-packages/flask/app.py", 
> line 1598, in dispatch_request
> return self.view_functions[rule.endpoint](**req.view_args)
>   File 
> "/home/airflow/virtualenvs/airflow/lib/python3.5/site-packages/flask_appbuilder/security/decorators.py",
>  line 26, in wraps
> return f(self, *args, **kwargs)
>   File 
> "/home/airflow/virtualenvs/airflow/lib/python3.5/site-packages/airflow/www_rbac/decorators.py",
>  line 55, in wrapper
> return f(*args, **kwargs)
>   File 
> "/home/airflow/virtualenvs/airflow/lib/python3.5/site-packages/airflow/www_rbac/views.py",
>  line 837, in clear
> include_upstream=upstream)
>   File 
> "/home/airflow/virtualenvs/airflow/lib/python3.5/site-packages/airflow/models.py",
>  line 4011, in sub_dag
> dag = copy.deepcopy(self)
>   File "/home/airflow/virtualenvs/airflow/lib/python3.5/copy.py", line 166, 
> in deepcopy
> y = copier(memo)
>   File 
> "/home/airflow/virtualenvs/airflow/lib/python3.5/site-packages/airflow/models.py",
>  line 3996, in __deepcopy__
> setattr(result, k, copy.deepcopy(v, memo))
>   File "/home/airflow/virtualenvs/airflow/lib/python3.5/copy.py", line 155, 
> in deepcopy
> y = copier(x, memo)
>   File "/home/airflow/virtualenvs/airflow/lib/python3.5/copy.py", line 243, 
> in _deepcopy_dict
> y[deepcopy(key, memo)] = deepcopy(value, memo)
>   File "/home/airflow/virtualenvs/airflow/lib/python3.5/copy.py", line 166, 
> in deepcopy
> y = copier(memo)
>   File 
> "/home/airflow/virtualenvs/airflow/lib/python3.5/site-packages/airflow/models.py",
>  line 2740, in __deepcopy__
> setattr(result, k, copy.deepcopy(v, memo))
>   File "/home/airflow/virtualenvs/airflow/lib/python3.5/copy.py", line 182, 
> in deepcopy
> y = _reconstruct(x, rv, 1, memo)
>   File "/home/airflow/virtualenvs/airflow/lib/python3.5/copy.py", line 297, 
> in _reconstruct
> state = deepcopy(state, memo)
>   File "/home/airflow/virtualenvs/airflow/lib/python3.5/copy.py", line 155, 
> in deepcopy
> y = copier(x, memo)
>   File "/home/airflow/virtualenvs/airflow/lib/python3.5/copy.py", line 243, 
> in _deepcopy_dict
> 

[jira] [Commented] (AIRFLOW-2747) Explicit re-schedule of sensors

2018-10-24 Thread Gabriel Silk (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-2747?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16662936#comment-16662936
 ] 

Gabriel Silk commented on AIRFLOW-2747:
---

This is awesome!

One thing – it looks like the "started_at" logic has only been implemented for 
BaseSensor. However, I would suggest we do the same for tasks as well, so that 
tasks can throw a reschedule exception during the course of their execution as 
well as sensors. This would be useful for tasks that figure out their own data 
dependencies at runtime – for example, Hive queries that depend on recent 
snapshot data. If we simply threw a reschedule exception, then we could attempt 
to execute these tasks periodically, rather than explicity modeling the data 
dependency via a sensor.

Does that make sense?

If people are interested, I'd be happy to open a PR

> Explicit re-schedule of sensors
> ---
>
> Key: AIRFLOW-2747
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2747
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: core, operators
>Affects Versions: 1.9.0, 1.10.0
>Reporter: Stefan Seelmann
>Assignee: Stefan Seelmann
>Priority: Major
> Fix For: 2.0.0
>
> Attachments: Screenshot_2018-07-12_14-10-24.png, 
> Screenshot_2018-09-16_20-09-28.png, Screenshot_2018-09-16_20-19-23.png, 
> google_apis-23_r01.zip
>
>
> By default sensors block a worker and just sleep between pokes. This is very 
> inefficient, especially when there are many long-running sensors.
> There is a hacky workaroud by setting a small timeout value and a high retry 
> number. But that has drawbacks:
>  * Errors raised by sensors are hidden and the sensor retries too often
>  * The sensor is retried in a fixed time interval (with optional exponential 
> backoff)
>  * There are many attempts and many log files are generated
>  I'd like to propose an explicit reschedule mechanism:
>  * A new "reschedule" flag for sensors, if set to True it will raise an 
> AirflowRescheduleException that causes a reschedule.
>  * AirflowRescheduleException contains the (earliest) re-schedule date.
>  * Reschedule requests are recorded in new `task_reschedule` table and 
> visualized in the Gantt view.
>  * A new TI dependency that checks if a sensor task is ready to be 
> re-scheduled.
> Advantages:
>  * This change is backward compatible. Existing sensors behave like before. 
> But it's possible to set the "reschedule" flag.
>  * The poke_interval, timeout, and soft_fail parameters are still respected 
> and used to calculate the next schedule time.
>  * Custom sensor implementations can even define the next sensible schedule 
> date by raising AirflowRescheduleException themselves.
>  * Existing TimeSensor and TimeDeltaSensor can also be changed to be 
> rescheduled when the time is reached.
>  * This mechanism can also be used by non-sensor operators (but then the new 
> ReadyToRescheduleDep has to be added to deps or BaseOperator).
> Design decisions and caveats:
>  * When handling AirflowRescheduleException the `try_number` is decremented. 
> That means that subsequent runs use the same try number and write to the same 
> log file.
>  * Sensor TI dependency check now depends on `task_reschedule` table. However 
> only the BaseSensorOperator includes the new ReadyToRescheduleDep for now.
> Open questions and TODOs:
>  * Should a dedicated state `UP_FOR_RESCHEDULE` be used instead of setting 
> the state back to `NONE`? This would require more changes in scheduler code 
> and especially in the UI, but the state of a task would be more explicit and 
> more transparent to the user.
>  * Add example/test for a non-sensor operator
>  * Document the new feature



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRFLOW-3171) Flexible task log organization

2018-10-08 Thread Gabriel Silk (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-3171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabriel Silk updated AIRFLOW-3171:
--
Description: 
Regardless of the backend (eg file system, s3, ...), it would be useful to be 
able to organize the task logs in a more flexible manner, rather than 
defaulting to a flat structure.

 

One use case of this would be to provide a better multi-tenancy experience when 
deploying a single airflow cluster to several teams. For example, if the task 
logs in s3 were organized like /tasks/[owner]/... then we could provide access 
to a subset of the logs for each team, by creating s3 access rules prefixed 
with the appropriate path.

 

One possible implementation would be to have a configurable, templatized path 
structure for logs. We would also need to store the log location for each task 
instance, so we could easily change the log folder structure without breaking 
old log paths.

  was:
Regardless of the backend (eg file system, s3, ...), it would be useful to be 
able to organize the task logs in a more flexible manner, rather than 
defaulting to a flat structure.

 

One use case of this would be to provide a better multi-tenancy experience when 
deploying a single airflow cluster to several teams. For example, if the log 
folders in s3 were organized like /tasks/[owner]/... then we could provide 
access to a subset of the logs for each team, by creating s3 access rules 
prefixed with the appropriate path.

 

One possible implementation would be to have a configurable, templatized path 
structure for logs. We would also need to store the log location for each task 
instance, so we could easily change the log folder structure without breaking 
old log paths.


> Flexible task log organization
> --
>
> Key: AIRFLOW-3171
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3171
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: logging
>Reporter: Gabriel Silk
>Priority: Minor
>
> Regardless of the backend (eg file system, s3, ...), it would be useful to be 
> able to organize the task logs in a more flexible manner, rather than 
> defaulting to a flat structure.
>  
> One use case of this would be to provide a better multi-tenancy experience 
> when deploying a single airflow cluster to several teams. For example, if the 
> task logs in s3 were organized like /tasks/[owner]/... then we could provide 
> access to a subset of the logs for each team, by creating s3 access rules 
> prefixed with the appropriate path.
>  
> One possible implementation would be to have a configurable, templatized path 
> structure for logs. We would also need to store the log location for each 
> task instance, so we could easily change the log folder structure without 
> breaking old log paths.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRFLOW-3171) Flexible task log organization

2018-10-08 Thread Gabriel Silk (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-3171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabriel Silk updated AIRFLOW-3171:
--
Description: 
Regardless of the backend (eg file system, s3, ...), it would be useful to be 
able to organize the task logs in a more flexible manner, rather than 
defaulting to a flat structure.

 

One use case of this would be to provide a better multi-tenancy experience when 
deploying a single airflow cluster to several teams. For example, if the log 
folders in s3 were organized like /tasks/[owner]/... then we could provide 
access to a subset of the logs for each team, by creating s3 access rules 
prefixed with the appropriate path.

 

One possible implementation would be to have a configurable, templatized path 
structure for logs. We would also need to store the log location for each task 
instance, so we could easily change the log folder structure without breaking 
old log paths.

  was:
Regardless of the backend (eg file system, s3, ...), it would be useful to be 
able to organize the logs in a more flexible manner, rather than defaulting to 
a flat structure.

 

One use case of this would be to provide a better multi-tenancy experience when 
deploying a single airflow cluster to several teams. For example, if the log 
folders in s3 were organized like /tasks/[owner]/... then we could provide 
access to a subset of the logs for each team, by creating s3 access rules 
prefixed with the appropriate path.

 

One possible implementation would be to have a configurable, templatized path 
structure for logs. We would also need to store the log location for each task 
instance, so we could easily change the log folder structure without breaking 
old log paths.


> Flexible task log organization
> --
>
> Key: AIRFLOW-3171
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3171
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: logging
>Reporter: Gabriel Silk
>Priority: Minor
>
> Regardless of the backend (eg file system, s3, ...), it would be useful to be 
> able to organize the task logs in a more flexible manner, rather than 
> defaulting to a flat structure.
>  
> One use case of this would be to provide a better multi-tenancy experience 
> when deploying a single airflow cluster to several teams. For example, if the 
> log folders in s3 were organized like /tasks/[owner]/... then we could 
> provide access to a subset of the logs for each team, by creating s3 access 
> rules prefixed with the appropriate path.
>  
> One possible implementation would be to have a configurable, templatized path 
> structure for logs. We would also need to store the log location for each 
> task instance, so we could easily change the log folder structure without 
> breaking old log paths.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRFLOW-3171) Flexible log organization

2018-10-08 Thread Gabriel Silk (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-3171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabriel Silk updated AIRFLOW-3171:
--
Description: 
Regardless of the backend (eg file system, s3, ...), it would be useful to be 
able to organize the logs in a more flexible manner, rather than defaulting to 
a flat structure.

 

One use case of this would be to provide a better multi-tenancy experience when 
deploying a single airflow cluster to several teams. For example, if the log 
folders in s3 were organized like /tasks/[owner]/... then we could provide 
access to a subset of the logs for each team, by creating s3 access rules 
prefixed with the appropriate path.

 

One possible implementation would be to have a configurable, templatized path 
structure for logs. We would also need to store the log location for each task 
instance, so we could easily change the log folder structure without breaking 
old log paths.

  was:
Regardless of the backend (eg file system, s3, ...), it would be useful to be 
able to organize the logs in a more flexible manner.

 

One use case of this would be to provide a better multi-tenancy experience when 
deploying a single airflow cluster to several teams. For example, if the log 
folders in s3 were organized like /tasks/[owner]/... then we could provide 
access to a subset of the logs for each team, by creating s3 access rules 
prefixed with the appropriate path.

 

One possible implementation would be to have a configurable, templatized path 
structure for logs. We would also need to store the log location for each task 
instance, so we could easily change the log folder structure without breaking 
old log paths.


> Flexible log organization
> -
>
> Key: AIRFLOW-3171
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3171
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: logging
>Reporter: Gabriel Silk
>Priority: Minor
>
> Regardless of the backend (eg file system, s3, ...), it would be useful to be 
> able to organize the logs in a more flexible manner, rather than defaulting 
> to a flat structure.
>  
> One use case of this would be to provide a better multi-tenancy experience 
> when deploying a single airflow cluster to several teams. For example, if the 
> log folders in s3 were organized like /tasks/[owner]/... then we could 
> provide access to a subset of the logs for each team, by creating s3 access 
> rules prefixed with the appropriate path.
>  
> One possible implementation would be to have a configurable, templatized path 
> structure for logs. We would also need to store the log location for each 
> task instance, so we could easily change the log folder structure without 
> breaking old log paths.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRFLOW-3171) Flexible task log organization

2018-10-08 Thread Gabriel Silk (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-3171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabriel Silk updated AIRFLOW-3171:
--
Summary: Flexible task log organization  (was: Flexible log organization)

> Flexible task log organization
> --
>
> Key: AIRFLOW-3171
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3171
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: logging
>Reporter: Gabriel Silk
>Priority: Minor
>
> Regardless of the backend (eg file system, s3, ...), it would be useful to be 
> able to organize the logs in a more flexible manner, rather than defaulting 
> to a flat structure.
>  
> One use case of this would be to provide a better multi-tenancy experience 
> when deploying a single airflow cluster to several teams. For example, if the 
> log folders in s3 were organized like /tasks/[owner]/... then we could 
> provide access to a subset of the logs for each team, by creating s3 access 
> rules prefixed with the appropriate path.
>  
> One possible implementation would be to have a configurable, templatized path 
> structure for logs. We would also need to store the log location for each 
> task instance, so we could easily change the log folder structure without 
> breaking old log paths.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRFLOW-3171) Flexible log organization

2018-10-08 Thread Gabriel Silk (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-3171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabriel Silk updated AIRFLOW-3171:
--
Description: 
Regardless of the backend (eg file system, s3, ...), it would be useful to be 
able to organize the logs in a more flexible manner.

 

One use case of this would be to provide a better multi-tenancy experience when 
deploying a single airflow cluster to several teams. For example, if the log 
folders in s3 were organized like /tasks/[owner]/... then we could provide 
access to a subset of the logs for each team, by creating s3 access rules 
prefixed with the appropriate path.

 

One possible implementation would be to have a configurable, templatized path 
structure for logs. We would also need to store the log location for each task 
instance, so we could easily change the log folder structure without breaking 
old log paths.

  was:
Regardless of the backend (eg file system, s3, ...), it would be useful to be 
able to organize the logs by owner.

 

One use case of this would be to provide a better multi-tenancy experience when 
deploying a single airflow cluster to several teams. If the log folders in s3 
were organized like /tasks/[owner]/... then we could provide access to a subset 
of the logs for each team, by creating s3 access rules prefixed with the 
appropriate path.

 

I also think that this would be a good change regardless of the multi-tenancy 
aspect, just in terms of organizing logs (vs the current flat namespace).

 

One possible implementation would be to have a configurable, templatized path 
structure for logs. We would also need to store the log location for each task 
instance, so we could easily change the log folder structure without breaking 
old log paths.


> Flexible log organization
> -
>
> Key: AIRFLOW-3171
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3171
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: logging
>Reporter: Gabriel Silk
>Priority: Minor
>
> Regardless of the backend (eg file system, s3, ...), it would be useful to be 
> able to organize the logs in a more flexible manner.
>  
> One use case of this would be to provide a better multi-tenancy experience 
> when deploying a single airflow cluster to several teams. For example, if the 
> log folders in s3 were organized like /tasks/[owner]/... then we could 
> provide access to a subset of the logs for each team, by creating s3 access 
> rules prefixed with the appropriate path.
>  
> One possible implementation would be to have a configurable, templatized path 
> structure for logs. We would also need to store the log location for each 
> task instance, so we could easily change the log folder structure without 
> breaking old log paths.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRFLOW-3171) Flexible log organization

2018-10-08 Thread Gabriel Silk (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-3171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabriel Silk updated AIRFLOW-3171:
--
Summary: Flexible log organization  (was: Organize logs by owner)

> Flexible log organization
> -
>
> Key: AIRFLOW-3171
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3171
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: logging
>Reporter: Gabriel Silk
>Priority: Minor
>
> Regardless of the backend (eg file system, s3, ...), it would be useful to be 
> able to organize the logs by owner.
>  
> One use case of this would be to provide a better multi-tenancy experience 
> when deploying a single airflow cluster to several teams. If the log folders 
> in s3 were organized like /tasks/[owner]/... then we could provide access to 
> a subset of the logs for each team, by creating s3 access rules prefixed with 
> the appropriate path.
>  
> I also think that this would be a good change regardless of the multi-tenancy 
> aspect, just in terms of organizing logs (vs the current flat namespace).
>  
> One possible implementation would be to have a configurable, templatized path 
> structure for logs. We would also need to store the log location for each 
> task instance, so we could easily change the log folder structure without 
> breaking old log paths.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRFLOW-3171) Organize logs by owner

2018-10-08 Thread Gabriel Silk (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-3171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabriel Silk updated AIRFLOW-3171:
--
Description: 
Regardless of the backend (eg file system, s3, ...), it would be useful to be 
able to organize the logs by owner.

 

One use case of this would be to provide a better multi-tenancy experience when 
deploying a single airflow cluster to several teams. If the log folders in s3 
were organized like /tasks/[owner]/... then we could provide access to a subset 
of the logs for each team, by creating s3 access rules prefixed with the 
appropriate path.

 

I also think that this would be a good change regardless of the multi-tenancy 
aspect, just in terms of organizing logs (vs the current flat namespace).

 

One possible implementation would be to have a configurable, templatized path 
structure for logs. We would also need to store the log location for each task 
instance, so we could easily change the log folder structure without breaking 
old log paths.

  was:
Regardless of the backend (eg file system, s3, ...), it would be useful to be 
able to organize the logs by owner.

 

One use case of this would be to provide a better multi-tenancy experience when 
deploying a single airflow cluster to several teams. If the log folders in s3 
were organized like /tasks/[owner]/... then we could provide access to a subset 
of the logs for each team, by creating s3 access rules prefixed with the 
appropriate path.

 

I also think that this would be a good change regardless of the multi-tenancy 
aspect, just in terms of organizing logs (vs the current flat namespace).

 

One possible implementation would be to have a configurable, templatized path 
structure. We would also need to store the log location for each task instance, 
so we could easily change the log folder structure without breaking old log 
paths.


> Organize logs by owner
> --
>
> Key: AIRFLOW-3171
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3171
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: logging
>Reporter: Gabriel Silk
>Priority: Minor
>
> Regardless of the backend (eg file system, s3, ...), it would be useful to be 
> able to organize the logs by owner.
>  
> One use case of this would be to provide a better multi-tenancy experience 
> when deploying a single airflow cluster to several teams. If the log folders 
> in s3 were organized like /tasks/[owner]/... then we could provide access to 
> a subset of the logs for each team, by creating s3 access rules prefixed with 
> the appropriate path.
>  
> I also think that this would be a good change regardless of the multi-tenancy 
> aspect, just in terms of organizing logs (vs the current flat namespace).
>  
> One possible implementation would be to have a configurable, templatized path 
> structure for logs. We would also need to store the log location for each 
> task instance, so we could easily change the log folder structure without 
> breaking old log paths.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRFLOW-3171) Organize logs by owner

2018-10-08 Thread Gabriel Silk (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-3171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabriel Silk updated AIRFLOW-3171:
--
Description: 
Regardless of the backend (eg file system, s3, ...), it would be useful to be 
able to organize the logs by owner.

 

One use case of this would be to provide a better multi-tenancy experience when 
deploying a single airflow cluster to several teams. If the log folders in s3 
were organized like /tasks/[owner]/... then we could provide access to a subset 
of the logs for each team, by creating s3 access rules prefixed with the 
appropriate path.

 

I also think that this would be a good change regardless of the multi-tenancy 
aspect, just in terms of organizing logs (vs the current flat namespace).

 

One possible implementation would be to have a configurable, templatized path 
structure. We would also need to store the log location for each task instance, 
so we could easily change the log folder structure without breaking old log 
paths.

  was:
Regardless of the backend (eg file system, s3, ...), it would be useful to be 
able to organize the logs by owner.

 

One use case of this would be to provide a better multi-tenancy experience when 
deploying a single airflow cluster to several teams. If the log folders in s3 
were organized like /tasks/[owner]/... then we could provide access to a subset 
of the logs for each team, by creating s3 access rules prefixed with the 
appropriate path.

 

I also think that this would be a good change regardless of the multi-tenancy 
aspect, just in terms of organizing logs (vs the current flat namespace).


> Organize logs by owner
> --
>
> Key: AIRFLOW-3171
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3171
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: logging
>Reporter: Gabriel Silk
>Priority: Minor
>
> Regardless of the backend (eg file system, s3, ...), it would be useful to be 
> able to organize the logs by owner.
>  
> One use case of this would be to provide a better multi-tenancy experience 
> when deploying a single airflow cluster to several teams. If the log folders 
> in s3 were organized like /tasks/[owner]/... then we could provide access to 
> a subset of the logs for each team, by creating s3 access rules prefixed with 
> the appropriate path.
>  
> I also think that this would be a good change regardless of the multi-tenancy 
> aspect, just in terms of organizing logs (vs the current flat namespace).
>  
> One possible implementation would be to have a configurable, templatized path 
> structure. We would also need to store the log location for each task 
> instance, so we could easily change the log folder structure without breaking 
> old log paths.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRFLOW-3171) Organize logs by owner

2018-10-08 Thread Gabriel Silk (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-3171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabriel Silk updated AIRFLOW-3171:
--
Description: 
Regardless of the backend (eg file system, s3, ...), it would be useful to be 
able to organize the logs by owner.

 

One use case of this would be to provide a better multi-tenancy experience when 
deploying a single airflow cluster to several teams. If the log folders in s3 
were organized like /tasks/[owner]/... then we could provide access to a subset 
of the logs for each team, by creating s3 access rules prefixed with the owner 
string.

 

I also think that this would be a good change regardless of the multi-tenancy 
aspect, just in terms of organizing logs (vs the current flat namespace).

  was:
Regardless of the backend (eg file system, s3, ...), it would be useful to be 
able to organize the logs by owner.

 

One use case of this would be to provide a better multi-tenancy experience when 
deploying a single airflow cluster to several teams. If the log folders in s3 
were organized like /tasks/[owner]/... then we could provide access to a subset 
of the logs for each time, by creating s3 access rules prefixed with the owner 
string.

 

I also think that this would be a good change regardless of the multi-tenancy 
aspect, just in terms of organizing logs (vs the current flat namespace).


> Organize logs by owner
> --
>
> Key: AIRFLOW-3171
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3171
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: logging
>Reporter: Gabriel Silk
>Priority: Minor
>
> Regardless of the backend (eg file system, s3, ...), it would be useful to be 
> able to organize the logs by owner.
>  
> One use case of this would be to provide a better multi-tenancy experience 
> when deploying a single airflow cluster to several teams. If the log folders 
> in s3 were organized like /tasks/[owner]/... then we could provide access to 
> a subset of the logs for each team, by creating s3 access rules prefixed with 
> the owner string.
>  
> I also think that this would be a good change regardless of the multi-tenancy 
> aspect, just in terms of organizing logs (vs the current flat namespace).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (AIRFLOW-3171) Organize logs by owner

2018-10-08 Thread Gabriel Silk (JIRA)


 [ 
https://issues.apache.org/jira/browse/AIRFLOW-3171?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gabriel Silk updated AIRFLOW-3171:
--
Description: 
Regardless of the backend (eg file system, s3, ...), it would be useful to be 
able to organize the logs by owner.

 

One use case of this would be to provide a better multi-tenancy experience when 
deploying a single airflow cluster to several teams. If the log folders in s3 
were organized like /tasks/[owner]/... then we could provide access to a subset 
of the logs for each team, by creating s3 access rules prefixed with the 
appropriate path.

 

I also think that this would be a good change regardless of the multi-tenancy 
aspect, just in terms of organizing logs (vs the current flat namespace).

  was:
Regardless of the backend (eg file system, s3, ...), it would be useful to be 
able to organize the logs by owner.

 

One use case of this would be to provide a better multi-tenancy experience when 
deploying a single airflow cluster to several teams. If the log folders in s3 
were organized like /tasks/[owner]/... then we could provide access to a subset 
of the logs for each team, by creating s3 access rules prefixed with the owner 
string.

 

I also think that this would be a good change regardless of the multi-tenancy 
aspect, just in terms of organizing logs (vs the current flat namespace).


> Organize logs by owner
> --
>
> Key: AIRFLOW-3171
> URL: https://issues.apache.org/jira/browse/AIRFLOW-3171
> Project: Apache Airflow
>  Issue Type: Improvement
>  Components: logging
>Reporter: Gabriel Silk
>Priority: Minor
>
> Regardless of the backend (eg file system, s3, ...), it would be useful to be 
> able to organize the logs by owner.
>  
> One use case of this would be to provide a better multi-tenancy experience 
> when deploying a single airflow cluster to several teams. If the log folders 
> in s3 were organized like /tasks/[owner]/... then we could provide access to 
> a subset of the logs for each team, by creating s3 access rules prefixed with 
> the appropriate path.
>  
> I also think that this would be a good change regardless of the multi-tenancy 
> aspect, just in terms of organizing logs (vs the current flat namespace).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AIRFLOW-3171) Organize logs by owner

2018-10-08 Thread Gabriel Silk (JIRA)
Gabriel Silk created AIRFLOW-3171:
-

 Summary: Organize logs by owner
 Key: AIRFLOW-3171
 URL: https://issues.apache.org/jira/browse/AIRFLOW-3171
 Project: Apache Airflow
  Issue Type: Improvement
  Components: logging
Reporter: Gabriel Silk


Regardless of the backend (eg file system, s3, ...), it would be useful to be 
able to organize the logs by owner.

 

One use case of this would be to provide a better multi-tenancy experience when 
deploying a single airflow cluster to several teams. If the log folders in s3 
were organized like /tasks/[owner]/... then we could provide access to a subset 
of the logs for each time, by creating s3 access rules prefixed with the owner 
string.

 

I also think that this would be a good change regardless of the multi-tenancy 
aspect, just in terms of organizing logs (vs the current flat namespace).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-1701) CSRF Error on Dag Runs Page

2018-08-24 Thread Gabriel Silk (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592241#comment-16592241
 ] 

Gabriel Silk commented on AIRFLOW-1701:
---

This should be fixed in the RBAC UI when my PR is merged: 
https://github.com/apache/incubator-airflow/pull/3804

> CSRF Error on Dag Runs Page
> ---
>
> Key: AIRFLOW-1701
> URL: https://issues.apache.org/jira/browse/AIRFLOW-1701
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: DagRun
>Affects Versions: Airflow 1.8
> Environment: Ubuntu 16.04 LTS; Google Chrome 61.0.3163.100 (Official 
> Build) (64-bit)
>Reporter: James Crowley
>Priority: Minor
> Attachments: CSRF_Error.png
>
>
> When attempting to modify the state of a Dag Run on /admin/dagrun, I receive 
> the following error message:
> 
> 400 Bad Request
> Bad Request
> CSRF token missing or incorrect.
> I am able to perform AJAX requests on other pages without issue. The missing 
> CSRF token appears to be isolated to the AJAX call to the 
> /admin/dagrun/ajax/update endpoint.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2866) Missing CSRF Token Error on Web UI Create/Update Operations

2018-08-24 Thread Gabriel Silk (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592239#comment-16592239
 ] 

Gabriel Silk commented on AIRFLOW-2866:
---

Here's a PR: https://github.com/apache/incubator-airflow/pull/3804

> Missing CSRF Token Error on Web UI Create/Update Operations
> ---
>
> Key: AIRFLOW-2866
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2866
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: webapp
>Reporter: Jasper Kahn
>Priority: Major
>
> Attempting to modify or delete many resources (such as Connections or Users) 
> results in a 400 from the webserver:
> {quote}{{Bad Request}}
> {{The CSRF session token is missing.}}{quote}
> Logs report:
> {quote}{{[2018-08-07 18:45:15,771] \{csrf.py:251} INFO - The CSRF session 
> token is missing.}}
> {{192.168.9.1 - - [07/Aug/2018:18:45:15 +] "POST 
> /admin/connection/delete/ HTTP/1.1" 400 150 
> "http://localhost:8081/admin/connection/; "Mozilla/5.0 (X11; Linux x86_64) 
> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 
> Safari/537.36"}}{quote}
> Chrome dev tools show the CSRF token is present in the request payload.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (AIRFLOW-2866) Missing CSRF Token Error on Web UI Create/Update Operations

2018-08-24 Thread Gabriel Silk (JIRA)


[ 
https://issues.apache.org/jira/browse/AIRFLOW-2866?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16592233#comment-16592233
 ] 

Gabriel Silk commented on AIRFLOW-2866:
---

This doesn't resolve the issue when using rbac UI. I'll submit a patch for that.

> Missing CSRF Token Error on Web UI Create/Update Operations
> ---
>
> Key: AIRFLOW-2866
> URL: https://issues.apache.org/jira/browse/AIRFLOW-2866
> Project: Apache Airflow
>  Issue Type: Bug
>  Components: webapp
>Reporter: Jasper Kahn
>Priority: Major
>
> Attempting to modify or delete many resources (such as Connections or Users) 
> results in a 400 from the webserver:
> {quote}{{Bad Request}}
> {{The CSRF session token is missing.}}{quote}
> Logs report:
> {quote}{{[2018-08-07 18:45:15,771] \{csrf.py:251} INFO - The CSRF session 
> token is missing.}}
> {{192.168.9.1 - - [07/Aug/2018:18:45:15 +] "POST 
> /admin/connection/delete/ HTTP/1.1" 400 150 
> "http://localhost:8081/admin/connection/; "Mozilla/5.0 (X11; Linux x86_64) 
> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/68.0.3440.84 
> Safari/537.36"}}{quote}
> Chrome dev tools show the CSRF token is present in the request payload.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (AIRFLOW-2430) Bad query patterns at scale prevent scheduler from starting

2018-05-07 Thread Gabriel Silk (JIRA)
Gabriel Silk created AIRFLOW-2430:
-

 Summary: Bad query patterns at scale prevent scheduler from 
starting
 Key: AIRFLOW-2430
 URL: https://issues.apache.org/jira/browse/AIRFLOW-2430
 Project: Apache Airflow
  Issue Type: Bug
  Components: scheduler
Reporter: Gabriel Silk


h2. Summary

Certain queries executed by the scheduler do not scale well with the number of 
tasks being operated on. Two example functions 
 * reset_state_for_orphaned_tasks
 * _execute_task_instances

 

Concretely — with a mere 75k tasks being operated on, the first query can take 
dozens of minutes to run, blocking the scheduler from making progress.

 

The cause is twofold:

1. As the query grows past a certain point, the MySQL planner will choose to do 
a full table scan as opposed to using an index. I assume the same is true of 
Postgres.

2. The query predicate size grows linearly in the number of tasks being 
operated, thus increasing the amount of work that needs to be done per row.

 

In a sense, you’re left with an operation that scales O(n^2)

 
h2. Proposed Fix

It appears that one of these bad query patterns was fixed in 
[3547cbffd|https://github.com/apache/incubator-airflow/commit/3547cbffdbffac2f98a8aa05526e8c9671221025]
 by introducing a configurable batch size with can be set via max_tis_per_query.

 

I propose we extend the suggested fix to include other poorly-performing 
queries in the scheduler.

 

I’ve identified two queries that are directly affecting my work and included 
them in the diff, though the same approach can be extended to more queries as 
we see fit.

 

Thanks!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)