[jira] [Created] (AIRFLOW-779) Task should fail with specific message if task instance is deleted
Alex Guziel created AIRFLOW-779: --- Summary: Task should fail with specific message if task instance is deleted Key: AIRFLOW-779 URL: https://issues.apache.org/jira/browse/AIRFLOW-779 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel Priority: Trivial Right now, when a task instance is deleted in the DB (as is in the UI task instances page), it will fail with a None have the state field accessed. We should handle this explicitly and give an explicit message. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AIRFLOW-799) Workers re-set queue column in task_instance table
Alex Guziel created AIRFLOW-799: --- Summary: Workers re-set queue column in task_instance table Key: AIRFLOW-799 URL: https://issues.apache.org/jira/browse/AIRFLOW-799 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel Priority: Minor Right now, the scheduler uses the policy file to set the queue field in task_instance. Workers, when updating the state, will set the queue according to the DAG information, changing it from the result that would be from applying the policy file. This reduces auditability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AIRFLOW-799) Workers re-set queue column in task_instance table
[ https://issues.apache.org/jira/browse/AIRFLOW-799?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel updated AIRFLOW-799: Assignee: (was: Alex Guziel) > Workers re-set queue column in task_instance table > -- > > Key: AIRFLOW-799 > URL: https://issues.apache.org/jira/browse/AIRFLOW-799 > Project: Apache Airflow > Issue Type: Bug >Reporter: Alex Guziel >Priority: Minor > > Right now, the scheduler uses the policy file to set the queue field in > task_instance. Workers, when updating the state, will set the queue according > to the DAG information, changing it from the result that would be from > applying the policy file. This reduces auditability. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AIRFLOW-836) The two endpoints /paused and queryview perform state-changing action over HTTP GET
Alex Guziel created AIRFLOW-836: --- Summary: The two endpoints /paused and queryview perform state-changing action over HTTP GET Key: AIRFLOW-836 URL: https://issues.apache.org/jira/browse/AIRFLOW-836 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel These two endpoints change state and allow HTTP GET, allowing CSRF -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-837) Clear task in the UI should state to None rather than deleting DB row
Alex Guziel created AIRFLOW-837: --- Summary: Clear task in the UI should state to None rather than deleting DB row Key: AIRFLOW-837 URL: https://issues.apache.org/jira/browse/AIRFLOW-837 Project: Apache Airflow Issue Type: Improvement Reporter: Alex Guziel Priority: Trivial -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-838) Race condition in LocalTaskJob
Alex Guziel created AIRFLOW-838: --- Summary: Race condition in LocalTaskJob Key: AIRFLOW-838 URL: https://issues.apache.org/jira/browse/AIRFLOW-838 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Priority: Minor Right now, a LocalTaskJob will terminate if the state is not "running" but only if it has observed that the state was "running" before. This could lead to a situation in which it never terminates although the state is not "running" if it was from "running" to another state before it could be observed. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AIRFLOW-836) The paused endpoint is vulnerable to CSRF
[ https://issues.apache.org/jira/browse/AIRFLOW-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel updated AIRFLOW-836: Description: This endpoint uses GET and is state-changing which bad practice, and allows CSRF (was: These two endpoints change state and allow HTTP GET, allowing CSRF) Summary: The paused endpoint is vulnerable to CSRF (was: The two endpoints /paused and queryview perform state-changing action over HTTP GET) > The paused endpoint is vulnerable to CSRF > - > > Key: AIRFLOW-836 > URL: https://issues.apache.org/jira/browse/AIRFLOW-836 > Project: Apache Airflow > Issue Type: Bug >Reporter: Alex Guziel >Assignee: Alex Guziel > > This endpoint uses GET and is state-changing which bad practice, and allows > CSRF -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AIRFLOW-836) The paused endpoint is vulnerable to CSRF
[ https://issues.apache.org/jira/browse/AIRFLOW-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel updated AIRFLOW-836: Description: These endpoints use GET and are state-changing which is bad practice, and allows CSRF (was: This endpoint uses GET and is state-changing which bad practice, and allows CSRF) > The paused endpoint is vulnerable to CSRF > - > > Key: AIRFLOW-836 > URL: https://issues.apache.org/jira/browse/AIRFLOW-836 > Project: Apache Airflow > Issue Type: Bug >Reporter: Alex Guziel >Assignee: Alex Guziel > > These endpoints use GET and are state-changing which is bad practice, and > allows CSRF -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AIRFLOW-836) The paused and queryview endpoints are vulnerable to CSRF
[ https://issues.apache.org/jira/browse/AIRFLOW-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel updated AIRFLOW-836: Summary: The paused and queryview endpoints are vulnerable to CSRF (was: The paused endpoint is vulnerable to CSRF) > The paused and queryview endpoints are vulnerable to CSRF > - > > Key: AIRFLOW-836 > URL: https://issues.apache.org/jira/browse/AIRFLOW-836 > Project: Apache Airflow > Issue Type: Bug >Reporter: Alex Guziel >Assignee: Alex Guziel > > These endpoints use GET and are state-changing which is bad practice, and > allows CSRF -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-857) Use unittest.assert instead of assert
Alex Guziel created AIRFLOW-857: --- Summary: Use unittest.assert instead of assert Key: AIRFLOW-857 URL: https://issues.apache.org/jira/browse/AIRFLOW-857 Project: Apache Airflow Issue Type: Improvement Reporter: Alex Guziel Assignee: Alex Guziel Priority: Minor Right now, unit tests do something like `assert x == y` which gives less descriptive output in case of failure than `assertEqual(x, y)` -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (AIRFLOW-696) Monitor queue lengths in CeleryExecutor
[ https://issues.apache.org/jira/browse/AIRFLOW-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel closed AIRFLOW-696. --- Resolution: Won't Fix > Monitor queue lengths in CeleryExecutor > --- > > Key: AIRFLOW-696 > URL: https://issues.apache.org/jira/browse/AIRFLOW-696 > Project: Apache Airflow > Issue Type: Improvement > Components: celery >Reporter: Alex Guziel >Assignee: Alex Guziel > > Monitor queue lengths for CeleryExecutor. This will make it easier to see how > much of the cluster is being used. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Work started] (AIRFLOW-836) The paused and queryview endpoints are vulnerable to CSRF
[ https://issues.apache.org/jira/browse/AIRFLOW-836?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on AIRFLOW-836 started by Alex Guziel. --- > The paused and queryview endpoints are vulnerable to CSRF > - > > Key: AIRFLOW-836 > URL: https://issues.apache.org/jira/browse/AIRFLOW-836 > Project: Apache Airflow > Issue Type: Bug >Reporter: Alex Guziel >Assignee: Alex Guziel > > These endpoints use GET and are state-changing which is bad practice, and > allows CSRF -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Work started] (AIRFLOW-857) Use unittest.assert instead of assert
[ https://issues.apache.org/jira/browse/AIRFLOW-857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on AIRFLOW-857 started by Alex Guziel. --- > Use unittest.assert instead of assert > - > > Key: AIRFLOW-857 > URL: https://issues.apache.org/jira/browse/AIRFLOW-857 > Project: Apache Airflow > Issue Type: Improvement >Reporter: Alex Guziel >Assignee: Alex Guziel >Priority: Minor > > Right now, unit tests do something like > `assert x == y` > which gives less descriptive output in case of failure than > `assertEqual(x, y)` -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (AIRFLOW-679) Stop concurrent task instances from running due to race conditions
[ https://issues.apache.org/jira/browse/AIRFLOW-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel resolved AIRFLOW-679. - Resolution: Fixed > Stop concurrent task instances from running due to race conditions > -- > > Key: AIRFLOW-679 > URL: https://issues.apache.org/jira/browse/AIRFLOW-679 > Project: Apache Airflow > Issue Type: Improvement > Components: core >Reporter: Alex Guziel >Assignee: Alex Guziel >Priority: Minor > > Right now, multiple copies of the same task instance can run if someone > clicks on the UI multiple times. To fix this, I propose two things: > 1) record hostname and pid in TaskInstance table, then when heartbeating, > only continue running if it matches -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-861) Pickle_info endpoint is unauthenticated
Alex Guziel created AIRFLOW-861: --- Summary: Pickle_info endpoint is unauthenticated Key: AIRFLOW-861 URL: https://issues.apache.org/jira/browse/AIRFLOW-861 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel Right now the admin/airflow/pickle_info is unauthenticated, allowing anyone to see the list of dags -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Work started] (AIRFLOW-861) Pickle_info endpoint is unauthenticated
[ https://issues.apache.org/jira/browse/AIRFLOW-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on AIRFLOW-861 started by Alex Guziel. --- > Pickle_info endpoint is unauthenticated > --- > > Key: AIRFLOW-861 > URL: https://issues.apache.org/jira/browse/AIRFLOW-861 > Project: Apache Airflow > Issue Type: Bug >Reporter: Alex Guziel >Assignee: Alex Guziel > > Right now the admin/airflow/pickle_info is unauthenticated, allowing anyone > to see the list of dags -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (AIRFLOW-857) Use unittest.assert instead of assert
[ https://issues.apache.org/jira/browse/AIRFLOW-857?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel resolved AIRFLOW-857. - Resolution: Fixed > Use unittest.assert instead of assert > - > > Key: AIRFLOW-857 > URL: https://issues.apache.org/jira/browse/AIRFLOW-857 > Project: Apache Airflow > Issue Type: Improvement >Reporter: Alex Guziel >Assignee: Alex Guziel >Priority: Minor > > Right now, unit tests do something like > `assert x == y` > which gives less descriptive output in case of failure than > `assertEqual(x, y)` -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-900) Double run job should not terminate the existing running job
Alex Guziel created AIRFLOW-900: --- Summary: Double run job should not terminate the existing running job Key: AIRFLOW-900 URL: https://issues.apache.org/jira/browse/AIRFLOW-900 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Right now, jobs seem to get run an hour after they start and due to the logic, both get killed. Since we can't isolate the cause, we improve the logic here to only kill the new job. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-924) (Named)HivePartitionSensor broken if hook attr not set
Alex Guziel created AIRFLOW-924: --- Summary: (Named)HivePartitionSensor broken if hook attr not set Key: AIRFLOW-924 URL: https://issues.apache.org/jira/browse/AIRFLOW-924 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel Right now, the import statement for (Named)HivePartitionSensor uses the wrong path -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AIRFLOW-921) 1.8.0rc Issues
[ https://issues.apache.org/jira/browse/AIRFLOW-921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15888782#comment-15888782 ] Alex Guziel commented on AIRFLOW-921: - Also add this https://issues.apache.org/jira/browse/AIRFLOW-924 > 1.8.0rc Issues > -- > > Key: AIRFLOW-921 > URL: https://issues.apache.org/jira/browse/AIRFLOW-921 > Project: Apache Airflow > Issue Type: Task >Reporter: Dan Davydov >Priority: Blocker > > These are the pending issues for the Airflow 1.8.0 release: > Blockers: > [~bolke] please merge into the next RC and then remove from the list the > issues below once they are merged into master > - Sub-tasks linked in this JIRA > - Skipped tasks potentially cause a dagrun to be marked as failure/success > prematurely (one theory is that this is the same issue as > https://issues.apache.org/jira/browse/AIRFLOW-872) > Other Issues: > - High DB Load (~8x more than 1.7) - We are still investigating but it's > probably not a blocker for the release - (Theories: Might need execution_date > index on dag_run (based on slow process list) OR it might be this query which > is long running SELECT union_ti.dag_id AS union_ti_dag_id, union_ti.state AS > union_ti_state, count( *) AS count_1 > FR)) > [~bolke] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AIRFLOW-921) 1.8.0rc Issues
[ https://issues.apache.org/jira/browse/AIRFLOW-921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15893094#comment-15893094 ] Alex Guziel commented on AIRFLOW-921: - The high DB load looks pretty periodic on our end (ie comes and go every 10 minutes). I did some profiling and I found a lot of areas to fix so I'm working on those. > 1.8.0rc Issues > -- > > Key: AIRFLOW-921 > URL: https://issues.apache.org/jira/browse/AIRFLOW-921 > Project: Apache Airflow > Issue Type: Task >Reporter: Dan Davydov >Priority: Blocker > > These are the pending issues for the Airflow 1.8.0 release: > Blockers: > [~bolke] please merge into the next RC and then remove from the list the > issues below once they are merged into master > - Sub-tasks linked in this JIRA > - Skipped tasks potentially cause a dagrun to be marked as failure/success > prematurely (one theory is that this is the same issue as > https://issues.apache.org/jira/browse/AIRFLOW-872) > Other Issues: > - High DB Load (~8x more than 1.7) - We are still investigating but it's > probably not a blocker for the release - (Theories: Might need execution_date > index on dag_run (based on slow process list) OR it might be this query which > is long running SELECT union_ti.dag_id AS union_ti_dag_id, union_ti.state AS > union_ti_state, count( *) AS count_1 > FR)) > - Front page loading time is a lot slower > [~bolke] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-937) task_stats makes extremely large prepared query
Alex Guziel created AIRFLOW-937: --- Summary: task_stats makes extremely large prepared query Key: AIRFLOW-937 URL: https://issues.apache.org/jira/browse/AIRFLOW-937 Project: Apache Airflow Issue Type: Improvement Reporter: Alex Guziel Assignee: Alex Guziel Right now, the task_stats endpoint makes a few extremely long queries. We can give up some accuracy and get huge speed wins -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-938) SQLAlchemy query in task_stats should be compatible with Postgres
Alex Guziel created AIRFLOW-938: --- Summary: SQLAlchemy query in task_stats should be compatible with Postgres Key: AIRFLOW-938 URL: https://issues.apache.org/jira/browse/AIRFLOW-938 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel Right now, we check for truthiness by comparing to 1, which is not portable and does not work on pgsql -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-939) Add swp files to gitignore
Alex Guziel created AIRFLOW-939: --- Summary: Add swp files to gitignore Key: AIRFLOW-939 URL: https://issues.apache.org/jira/browse/AIRFLOW-939 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-961) LocalTaskJob onkill should get run on TERM
Alex Guziel created AIRFLOW-961: --- Summary: LocalTaskJob onkill should get run on TERM Key: AIRFLOW-961 URL: https://issues.apache.org/jira/browse/AIRFLOW-961 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel Right now, the on_kill happens in the finally block, when it should also be handled in a SIGTERM -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-970) Latest runs on homepage should load async and in batch
Alex Guziel created AIRFLOW-970: --- Summary: Latest runs on homepage should load async and in batch Key: AIRFLOW-970 URL: https://issues.apache.org/jira/browse/AIRFLOW-970 Project: Apache Airflow Issue Type: Improvement Reporter: Alex Guziel Assignee: Alex Guziel The latest_dag_run column on the homepage makes one query for each dag and does it synchronously. We should do the opposite. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (AIRFLOW-976) Mark success running task causes it to fail
[ https://issues.apache.org/jira/browse/AIRFLOW-976?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel reassigned AIRFLOW-976: --- Assignee: Alex Guziel > Mark success running task causes it to fail > --- > > Key: AIRFLOW-976 > URL: https://issues.apache.org/jira/browse/AIRFLOW-976 > Project: Apache Airflow > Issue Type: Bug >Reporter: Dan Davydov >Assignee: Alex Guziel > > Marking success on a running task in the UI causes it to fail. > Expected Behavior: > Task instance is killed and marked as successful > Actual Behavior: > Task instance is killed and marked as failed > [~saguziel] [~bolke] -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AIRFLOW-928) Same {task,execution_date} run multiple times in worker when using CeleryExecutor
[ https://issues.apache.org/jira/browse/AIRFLOW-928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925006#comment-15925006 ] Alex Guziel commented on AIRFLOW-928: - [~bolke] Did see a double trigger one hour after, will see if related. > Same {task,execution_date} run multiple times in worker when using > CeleryExecutor > - > > Key: AIRFLOW-928 > URL: https://issues.apache.org/jira/browse/AIRFLOW-928 > Project: Apache Airflow > Issue Type: Bug > Components: celery >Affects Versions: Airflow 1.7.1.3 > Environment: Docker >Reporter: Uri Shamay > Attachments: airflow.log, dag_runs.png, dummy_dag.py, processes.list, > rabbitmq.queue, scheduler.log, worker_2.log, worker.log > > > Hi, > When using with Airflow with CeleryExecutor, both RabbitMQ && Redis I tested, > I see that when workers are down, the scheduler run each period of time > **append** to the same key of {task,execution_date} in the broker, the same > {task,execution_date}, what means is that if workers are down/can't connect > to broker for few hours, I got in the broker thousands of same executions. > In my scenario I have just one dummy dag to run with dag_concurrency of 4, > I expected in that scenario that broker will hold just 4 messages, and the > scheduler shouldn't queuing another and another and another for same {task, > execution_date} > What happened is that when workers start to consume messages, they got > thousands of tasks for just 4 tasks, and when they trying to write to > database for task_instances - there are errors of integrity while such > {task,execution_date} already exist. > Note that in my test after let Airflow to consume works of just one dag > without workers for few hours, then I connect to the broker outside by custom > client and retrieve the messages - there was thousands of same > {dag,execution_date}. > Even if the case is that there are a lot of dag works on the same key that > run just one instance when poll thousands - it's still bad behavior, better > to produce one message to the queue, and if some timeout occurred (like > visibility), to set the key - and not append to it. > What happened is when workers are down for long time and have a lot of jobs > that scheduled each minute, when workers come back, they got thousands of > same jobs => cause to the worker to run the same dags a lot of times => a lot > of wasted python runners => utilized all celery worker threads/processes => > starve all other jobs till he understood that need just one instance from all > same. > Attached files: > 1. airflow.log - this is the task log, you can see that few instances > processes of same {task,execution_date} write to the same log file. > 2. worker.log - this is the worker log, you can see that worker trying to run > same {task,execution_date} multiple times + the errors from the database > integrity that said that those tasks on those dates already exists. > 3. scheduler.log to show that scheduler decided to send again and again and > again infinitely the same {job,execution_date} > 4. the dummy_dag.py of the test > 5. rabbitmq.queue - show that after 5 minutes the broker queue contains 40 > messages of same 4 {job,execution_date} > 6. dag_runs.png - show that there are only 4 jobs that need to be run, while > there are much more messages in the queue > 7. processes.list - show that when start worker and doing: ps -ef | grep > "airflow run", it show that worker run multiple times same > {job,execution_date} > 8. worker_2.log - show that when worker started - the same > {job,execution_date} keys shown multiple times > Thanks. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AIRFLOW-867) Tons of unit tests are ignored
[ https://issues.apache.org/jira/browse/AIRFLOW-867?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15925649#comment-15925649 ] Alex Guziel commented on AIRFLOW-867: - I wonder if this is bad news or good news > Tons of unit tests are ignored > -- > > Key: AIRFLOW-867 > URL: https://issues.apache.org/jira/browse/AIRFLOW-867 > Project: Apache Airflow > Issue Type: Bug > Components: tests >Reporter: George Sakkis >Assignee: George Sakkis > > I was poking around in tests and found out that lots of tests are not > discovered by nosetests: > {noformat} > $ nosetests -q --collect-only > -- > Ran 254 tests in 0.948s > $ grep -R 'def test' tests/ | wc -l > 360 > {noformat} > Initially I thought it might be related to not having installed all extra > dependencies but it turns out it's because apparently nosetests expects > explicit import of the related modules instead of discovering them > automatically (like py.test). For example, when adding an {{from > .ti_deps.deps.runnable_exec_date_dep import *}} in {{tests/__init__.py}} it > finds 260 tests, while when commenting out all imports in this module it > finds only 15! > h4. Possible options > * Quick fix: Add the necessary missing "import *" to discover all current > tests. > * Better fix: Rename all test modules to start with "test_" > -Move from nosetests to py.test and get rid of the ugly error-prone 'import > *' hack.- -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-991) Mark_success while a task is running leads to failure state
Alex Guziel created AIRFLOW-991: --- Summary: Mark_success while a task is running leads to failure state Key: AIRFLOW-991 URL: https://issues.apache.org/jira/browse/AIRFLOW-991 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (AIRFLOW-991) Mark_success while a task is running leads to failure state
[ https://issues.apache.org/jira/browse/AIRFLOW-991?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel closed AIRFLOW-991. --- Resolution: Duplicate > Mark_success while a task is running leads to failure state > --- > > Key: AIRFLOW-991 > URL: https://issues.apache.org/jira/browse/AIRFLOW-991 > Project: Apache Airflow > Issue Type: Bug >Reporter: Alex Guziel >Assignee: Alex Guziel > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (AIRFLOW-938) SQLAlchemy query in task_stats should be compatible with Postgres
[ https://issues.apache.org/jira/browse/AIRFLOW-938?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel resolved AIRFLOW-938. - Resolution: Fixed > SQLAlchemy query in task_stats should be compatible with Postgres > - > > Key: AIRFLOW-938 > URL: https://issues.apache.org/jira/browse/AIRFLOW-938 > Project: Apache Airflow > Issue Type: Bug >Reporter: Alex Guziel >Assignee: Alex Guziel > > Right now, we check for truthiness by comparing to 1, which is not portable > and does not work on pgsql -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (AIRFLOW-861) Pickle_info endpoint is unauthenticated
[ https://issues.apache.org/jira/browse/AIRFLOW-861?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel resolved AIRFLOW-861. - Resolution: Fixed > Pickle_info endpoint is unauthenticated > --- > > Key: AIRFLOW-861 > URL: https://issues.apache.org/jira/browse/AIRFLOW-861 > Project: Apache Airflow > Issue Type: Bug >Reporter: Alex Guziel >Assignee: Alex Guziel > > Right now the admin/airflow/pickle_info is unauthenticated, allowing anyone > to see the list of dags -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-1007) Jinja sandbox is vulnerable to RCE
Alex Guziel created AIRFLOW-1007: Summary: Jinja sandbox is vulnerable to RCE Key: AIRFLOW-1007 URL: https://issues.apache.org/jira/browse/AIRFLOW-1007 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel Right now, the jinja template functionality in chart_data takes arbitrary strings and executes them. We should use the sandbox functionality to prevent this. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-1035) Exponential backoff retry logic should use 2 as base
Alex Guziel created AIRFLOW-1035: Summary: Exponential backoff retry logic should use 2 as base Key: AIRFLOW-1035 URL: https://issues.apache.org/jira/browse/AIRFLOW-1035 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel Right now, the exponential backoff logic computes it as (retry_period) ^ (retry_number) instead of retry_period * 2 ^ retry_number. See https://en.wikipedia.org/wiki/Exponential_backoff -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-1036) Exponential backoff should use randomization
Alex Guziel created AIRFLOW-1036: Summary: Exponential backoff should use randomization Key: AIRFLOW-1036 URL: https://issues.apache.org/jira/browse/AIRFLOW-1036 Project: Apache Airflow Issue Type: Improvement Reporter: Alex Guziel Assignee: Alex Guziel This prevents the thundering herd problem. I think with the current way this is used, we would need to use some hashing function based on some subset of the dag_run, task_id, dag_id, and execution_date to emulate the RNG. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-1038) Specify celery serializers explicitly
Alex Guziel created AIRFLOW-1038: Summary: Specify celery serializers explicitly Key: AIRFLOW-1038 URL: https://issues.apache.org/jira/browse/AIRFLOW-1038 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel Celery 3->4 upgrade changes the default task and result serializer from pickle to json. Pickle is faster and supports more types http://docs.celeryproject.org/en/latest/userguide/calling.html This also causes issues when different versions of celery are running on different hosts. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AIRFLOW-1038) Specify celery serializers explicitly and pin version
[ https://issues.apache.org/jira/browse/AIRFLOW-1038?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel updated AIRFLOW-1038: - Summary: Specify celery serializers explicitly and pin version (was: Specify celery serializers explicitly) > Specify celery serializers explicitly and pin version > - > > Key: AIRFLOW-1038 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1038 > Project: Apache Airflow > Issue Type: Bug >Reporter: Alex Guziel >Assignee: Alex Guziel > > Celery 3->4 upgrade changes the default task and result serializer from > pickle to json. Pickle is faster and supports more types > http://docs.celeryproject.org/en/latest/userguide/calling.html > This also causes issues when different versions of celery are running on > different hosts. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (AIRFLOW-1036) Exponential backoff should use randomization
[ https://issues.apache.org/jira/browse/AIRFLOW-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel reassigned AIRFLOW-1036: Assignee: (was: Alex Guziel) > Exponential backoff should use randomization > > > Key: AIRFLOW-1036 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1036 > Project: Apache Airflow > Issue Type: Improvement >Reporter: Alex Guziel > > This prevents the thundering herd problem. I think with the current way this > is used, we would need to use some hashing function based on some subset of > the dag_run, task_id, dag_id, and execution_date to emulate the RNG. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-1047) Airflow logs vulnerable to XSS
Alex Guziel created AIRFLOW-1047: Summary: Airflow logs vulnerable to XSS Key: AIRFLOW-1047 URL: https://issues.apache.org/jira/browse/AIRFLOW-1047 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel Navigating to a page with dag_id param specified as a html tag leads to that tag being rendered due to using Markup tag (which makes html be labeled as safe) -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-1059) Reset_state_for_orphaned_task should operate in batch for the scheduler
Alex Guziel created AIRFLOW-1059: Summary: Reset_state_for_orphaned_task should operate in batch for the scheduler Key: AIRFLOW-1059 URL: https://issues.apache.org/jira/browse/AIRFLOW-1059 Project: Apache Airflow Issue Type: Improvement Reporter: Alex Guziel Assignee: Alex Guziel Scheduler startup is very slow due to resetting state making a query for each dag run. We should be able to do this in a constant number of queries which will increase scheduler startup time significantly. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-1064) TaskInstanceModelView is slow
Alex Guziel created AIRFLOW-1064: Summary: TaskInstanceModelView is slow Key: AIRFLOW-1064 URL: https://issues.apache.org/jira/browse/AIRFLOW-1064 Project: Apache Airflow Issue Type: Improvement Reporter: Alex Guziel Assignee: Alex Guziel Due to a bad query (a full table scan), the TaskInstanceModelView is very slow. Adding an index is costly, and job_id is a good approximation. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-1069) Pool slots not obeyed
Alex Guziel created AIRFLOW-1069: Summary: Pool slots not obeyed Key: AIRFLOW-1069 URL: https://issues.apache.org/jira/browse/AIRFLOW-1069 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel Right now, the decrement is done in an incorrect way that is not preserved across iterations -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (AIRFLOW-1069) Pool slots not obeyed
[ https://issues.apache.org/jira/browse/AIRFLOW-1069?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel closed AIRFLOW-1069. Resolution: Invalid > Pool slots not obeyed > - > > Key: AIRFLOW-1069 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1069 > Project: Apache Airflow > Issue Type: Bug >Reporter: Alex Guziel >Assignee: Alex Guziel > > Right now, the decrement is done in an incorrect way that is not preserved > across iterations -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-1074) Do not count queued tasks in scheduler concurrency check
Alex Guziel created AIRFLOW-1074: Summary: Do not count queued tasks in scheduler concurrency check Key: AIRFLOW-1074 URL: https://issues.apache.org/jira/browse/AIRFLOW-1074 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-1077) Subdags can deadlock
Alex Guziel created AIRFLOW-1077: Summary: Subdags can deadlock Key: AIRFLOW-1077 URL: https://issues.apache.org/jira/browse/AIRFLOW-1077 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Given a concurrency of n, if all n running tasks are Subdags, the subdags block any of their tasks from executing, leading to deadlock -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-1078) Latest_runs endpoint broken in old flask versions
Alex Guziel created AIRFLOW-1078: Summary: Latest_runs endpoint broken in old flask versions Key: AIRFLOW-1078 URL: https://issues.apache.org/jira/browse/AIRFLOW-1078 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-1081) Task duration page is slow
Alex Guziel created AIRFLOW-1081: Summary: Task duration page is slow Key: AIRFLOW-1081 URL: https://issues.apache.org/jira/browse/AIRFLOW-1081 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel It makes a number of queries proportional to the data size, instead of just 2. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-1104) Concurrency check in scheduler should count queued tasks as well as running
Alex Guziel created AIRFLOW-1104: Summary: Concurrency check in scheduler should count queued tasks as well as running Key: AIRFLOW-1104 URL: https://issues.apache.org/jira/browse/AIRFLOW-1104 Project: Apache Airflow Issue Type: Bug Environment: see https://github.com/apache/incubator-airflow/pull/2221 "Tasks with the QUEUED state should also be counted below, but for now we cannot count them. This is because there is no guarantee that queued tasks in failed dagruns will or will not eventually run and queued tasks that will never run will consume slots and can stall a DAG. Once we can guarantee that all queued tasks in failed dagruns will never run (e.g. make sure that all running/newly queued TIs have running dagruns), then we can include QUEUED tasks here, with the constraint that they are in running dagruns." Reporter: Alex Guziel Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-1105) Consolidate airflow run "raw" and "local"
Alex Guziel created AIRFLOW-1105: Summary: Consolidate airflow run "raw" and "local" Key: AIRFLOW-1105 URL: https://issues.apache.org/jira/browse/AIRFLOW-1105 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-1109) kill_process_tree should use KILL signal and log results
Alex Guziel created AIRFLOW-1109: Summary: kill_process_tree should use KILL signal and log results Key: AIRFLOW-1109 URL: https://issues.apache.org/jira/browse/AIRFLOW-1109 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Closed] (AIRFLOW-1109) kill_process_tree should use KILL signal and log results
[ https://issues.apache.org/jira/browse/AIRFLOW-1109?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel closed AIRFLOW-1109. Resolution: Fixed > kill_process_tree should use KILL signal and log results > > > Key: AIRFLOW-1109 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1109 > Project: Apache Airflow > Issue Type: Bug >Reporter: Alex Guziel >Assignee: Alex Guziel > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-1112) Log which pool is full in scheduler when pool slots are full
Alex Guziel created AIRFLOW-1112: Summary: Log which pool is full in scheduler when pool slots are full Key: AIRFLOW-1112 URL: https://issues.apache.org/jira/browse/AIRFLOW-1112 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (AIRFLOW-1112) Log which pool is full in scheduler when pool slots are full
[ https://issues.apache.org/jira/browse/AIRFLOW-1112?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel resolved AIRFLOW-1112. -- Resolution: Fixed Fix Version/s: 1.9.0 Issue resolved by pull request #2242 [https://github.com/apache/incubator-airflow/pull/2242] > Log which pool is full in scheduler when pool slots are full > > > Key: AIRFLOW-1112 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1112 > Project: Apache Airflow > Issue Type: Bug >Reporter: Alex Guziel >Assignee: Alex Guziel > Fix For: 1.9.0 > > -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AIRFLOW-1131) DockerOperator jobs time out and get stuck in "running" forever
[ https://issues.apache.org/jira/browse/AIRFLOW-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15977160#comment-15977160 ] Alex Guziel commented on AIRFLOW-1131: -- Did you verify it was not actually running? Are you using celery_executor? The reload task actually also fails because of ```{models.py:1140} INFO - Dependencies not met for , dependency 'Task Instance Not Already Running' FAILED: Task is already running, it started on 2017-04-20 11:19:59.597425.``` so it never actually gets run. The original continues to run in our case. > DockerOperator jobs time out and get stuck in "running" forever > --- > > Key: AIRFLOW-1131 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1131 > Project: Apache Airflow > Issue Type: Bug > Components: scheduler >Affects Versions: 1.9.0 > Environment: Python 2.7.12 > git+git://github.com/apache/incubator-airflow.git@35e43f5067f4741640278b765c0e54e4fd45ffa3#egg=airflow[async,password,celery,crypto,postgres,hive,hdfs,jdbc] >Reporter: Vitor Baptista > > With the following DAG and task: > {code} > import os > from datetime import datetime, timedelta > from airflow.models import DAG > from airflow.operators.docker_operator import DockerOperator > default_args = { > 'owner': 'airflow', > 'depends_on_past': False, > 'start_date': datetime(2017, 1, 1), > 'retries': 3, > 'retry_delay': timedelta(minutes=10), > } > dag = DAG( > dag_id='smoke_test', > default_args=default_args, > max_active_runs=1, > schedule_interval='@daily' > ) > sleep_forever_task = DockerOperator( > task_id='sleep_forever', > dag=dag, > image='alpine:latest', > api_version=os.environ.get('DOCKER_API_VERSION', '1.23'), > command='sleep {}'.format(60 * 60 * 24), > ) > {code} > When I run it, this is what I get: > {code} > *** Log file isn't local. > *** Fetching here: > http://589ea17432ec:8793/log/smoke_test/sleep_forever/2017-04-18T00:00:00 > [2017-04-20 11:19:58,258] {models.py:172} INFO - Filling up the DagBag from > /usr/local/airflow/dags/smoke_test.py > [2017-04-20 11:19:58,438] {base_task_runner.py:112} INFO - Running: ['bash', > '-c', u'airflow run smoke_test sleep_forever 2017-04-18T00:00:00 --job_id > 2537 --raw -sd DAGS_FOLDER/smoke_test.py'] > [2017-04-20 11:19:58,888] {base_task_runner.py:95} INFO - Subtask: > /usr/local/airflow/src/airflow/airflow/configuration.py:128: > DeprecationWarning: This method will be removed in future versions. Use > 'parser.read_file()' instead. > [2017-04-20 11:19:58,888] {base_task_runner.py:95} INFO - Subtask: > self.readfp(StringIO.StringIO(string)) > [2017-04-20 11:19:59,214] {base_task_runner.py:95} INFO - Subtask: > [2017-04-20 11:19:59,214] {__init__.py:56} INFO - Using executor > CeleryExecutor > [2017-04-20 11:19:59,227] {base_task_runner.py:95} INFO - Subtask: > [2017-04-20 11:19:59,227] {driver.py:120} INFO - Generating grammar tables > from /usr/lib/python2.7/lib2to3/Grammar.txt > [2017-04-20 11:19:59,244] {base_task_runner.py:95} INFO - Subtask: > [2017-04-20 11:19:59,244] {driver.py:120} INFO - Generating grammar tables > from /usr/lib/python2.7/lib2to3/PatternGrammar.txt > [2017-04-20 11:19:59,377] {base_task_runner.py:95} INFO - Subtask: > [2017-04-20 11:19:59,377] {models.py:172} INFO - Filling up the DagBag from > /usr/local/airflow/dags/smoke_test.py > [2017-04-20 11:19:59,597] {base_task_runner.py:95} INFO - Subtask: > [2017-04-20 11:19:59,597] {models.py:1146} INFO - Dependencies all met for > > [2017-04-20 11:19:59,605] {base_task_runner.py:95} INFO - Subtask: > [2017-04-20 11:19:59,605] {models.py:1146} INFO - Dependencies all met for > > [2017-04-20 11:19:59,605] {base_task_runner.py:95} INFO - Subtask: > [2017-04-20 11:19:59,605] {models.py:1338} INFO - > [2017-04-20 11:19:59,606] {base_task_runner.py:95} INFO - Subtask: > > [2017-04-20 11:19:59,606] {base_task_runner.py:95} INFO - Subtask: Starting > attempt 1 of 4 > [2017-04-20 11:19:59,606] {base_task_runner.py:95} INFO - Subtask: > > [2017-04-20 11:19:59,606] {base_task_runner.py:95} INFO - Subtask: > [2017-04-20 11:19:59,620] {base_task_runner.py:95} INFO - Subtask: > [2017-04-20 11:19:59,620] {models.py:1362} INFO - Executing > on 2017-04-18 00:00:00 > [2017-04-20 11:19:59,662] {base_task_runner.py:95} INFO - Subtask: > [2017-04-20 11:19:59,661] {docker_operator.py:132} INFO - Starting docker > container from image alpine:latest > [2017-04-20 12:21:25,661] {models.py:172} INFO - Filling up the DagBag from > /usr/local/airflow/dags/smoke_test.py > [2017-04-20 12:21:25,809] {base_task_runner.py:112} I
[jira] [Created] (AIRFLOW-1133) More tasks than the concurrency limit can run
Alex Guziel created AIRFLOW-1133: Summary: More tasks than the concurrency limit can run Key: AIRFLOW-1133 URL: https://issues.apache.org/jira/browse/AIRFLOW-1133 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel There are two ways to put checks on dag concurrency: 1) The scheduler not queueing too many tasks (via checking the amount of tasks running) 2) The worker checking that not too many tasks are running (via the db) Right now, both have issues. 1 doesn't considered queued tasks which may not be running now, but will be running soon. Hopefully, check 2 should catch it, but it does not check the condition properly as it only locks the row, and it seems locking the dag would also be expensive. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (AIRFLOW-1036) Exponential backoff should use randomization
[ https://issues.apache.org/jira/browse/AIRFLOW-1036?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel reassigned AIRFLOW-1036: Assignee: Alex Guziel > Exponential backoff should use randomization > > > Key: AIRFLOW-1036 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1036 > Project: Apache Airflow > Issue Type: Improvement >Reporter: Alex Guziel >Assignee: Alex Guziel > > This prevents the thundering herd problem. I think with the current way this > is used, we would need to use some hashing function based on some subset of > the dag_run, task_id, dag_id, and execution_date to emulate the RNG. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Commented] (AIRFLOW-1155) Add Tails.com to community
[ https://issues.apache.org/jira/browse/AIRFLOW-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15989420#comment-15989420 ] Alex Guziel commented on AIRFLOW-1155: -- Hmm for some reaosn cant close this issue > Add Tails.com to community > -- > > Key: AIRFLOW-1155 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1155 > Project: Apache Airflow > Issue Type: Wish > Components: Documentation >Reporter: Alan Cruickshank >Assignee: Alex Guziel >Priority: Trivial > Fix For: 1.9.0 > > > Add to README.md > ``` > 1. [Tails.com](https://tails.com/) > [[@alanmcruickshank](https://github.com/alanmcruickshank)] > ``` -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (AIRFLOW-1155) Add Tails.com to community
[ https://issues.apache.org/jira/browse/AIRFLOW-1155?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel reassigned AIRFLOW-1155: Assignee: Alex Guziel (was: Alan Cruickshank) > Add Tails.com to community > -- > > Key: AIRFLOW-1155 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1155 > Project: Apache Airflow > Issue Type: Wish > Components: Documentation >Reporter: Alan Cruickshank >Assignee: Alex Guziel >Priority: Trivial > Fix For: 1.9.0 > > > Add to README.md > ``` > 1. [Tails.com](https://tails.com/) > [[@alanmcruickshank](https://github.com/alanmcruickshank)] > ``` -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-1268) Celery bug can cause tasks to be delayed indefinitely
Alex Guziel created AIRFLOW-1268: Summary: Celery bug can cause tasks to be delayed indefinitely Key: AIRFLOW-1268 URL: https://issues.apache.org/jira/browse/AIRFLOW-1268 Project: Apache Airflow Issue Type: Bug Components: celery Environment: With celery_executor with redis Reporter: Alex Guziel Priority: Blocker With celery, tasks can get delayed indefinitely (or default 1 hour) due to a bug with celery, see https://github.com/celery/celery/issues/3765 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Updated] (AIRFLOW-1268) Celery bug can cause tasks to be delayed indefinitely
[ https://issues.apache.org/jira/browse/AIRFLOW-1268?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel updated AIRFLOW-1268: - Priority: Critical (was: Blocker) > Celery bug can cause tasks to be delayed indefinitely > - > > Key: AIRFLOW-1268 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1268 > Project: Apache Airflow > Issue Type: Bug > Components: celery > Environment: With celery_executor with redis >Reporter: Alex Guziel >Priority: Critical > > With celery, tasks can get delayed indefinitely (or default 1 hour) due to a > bug with celery, see https://github.com/celery/celery/issues/3765 -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-1269) Job table should be reloaded into memory on Scheduler start
Alex Guziel created AIRFLOW-1269: Summary: Job table should be reloaded into memory on Scheduler start Key: AIRFLOW-1269 URL: https://issues.apache.org/jira/browse/AIRFLOW-1269 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel Right now, running jobs that stop heartbeating will not be restarted if the scheduler has been restarted since then. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-1289) Don't restrict scheduler threads to CPU cores
Alex Guziel created AIRFLOW-1289: Summary: Don't restrict scheduler threads to CPU cores Key: AIRFLOW-1289 URL: https://issues.apache.org/jira/browse/AIRFLOW-1289 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel There's really no reason to, and DAG processing can have blocking IO -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Assigned] (AIRFLOW-1265) Exception happens when loading celery configurations.
[ https://issues.apache.org/jira/browse/AIRFLOW-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel reassigned AIRFLOW-1265: Assignee: Alex Guziel (was: Chienhsiung Chao) > Exception happens when loading celery configurations. > - > > Key: AIRFLOW-1265 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1265 > Project: Apache Airflow > Issue Type: Bug > Environment: OSX >Reporter: Chienhsiung Chao >Assignee: Alex Guziel > > airflow@f300f25ced3a:/usr/local/script$ airflow > [2017-06-02 02:25:59,263] {configuration.py:199} WARNING - section/key > [celery/celery_ssl_key] not found in config > Traceback (most recent call last): > File "/incubator-airflow/airflow/executors/celery_executor.py", line 52, in > CeleryConfig > BROKER_USE_SSL = {'keyfile': configuration.get('celery', > 'CELERY_SSL_KEY'), > File "/incubator-airflow/airflow/configuration.py", line 398, in get > return conf.get(section, key, **kwargs) > File "/incubator-airflow/airflow/configuration.py", line 203, in get > "in config".format(**locals())) > airflow.exceptions.AirflowConfigException: section/key > [celery/celery_ssl_key] not found in config > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "/usr/local/bin/airflow", line 6, in > exec(compile(open(__file__).read(), __file__, 'exec')) > File "/incubator-airflow/airflow/bin/airflow", line 18, in > from airflow.bin.cli import CLIFactory > File "/incubator-airflow/airflow/bin/cli.py", line 46, in > from airflow import jobs, settings > File "/incubator-airflow/airflow/jobs.py", line 66, in > class BaseJob(Base, LoggingMixin): > File "/incubator-airflow/airflow/jobs.py", line 98, in BaseJob > executor=executors.GetDefaultExecutor(), > File "/incubator-airflow/airflow/executors/__init__.py", line 43, in > GetDefaultExecutor > DEFAULT_EXECUTOR = _get_executor(executor_name) > File "/incubator-airflow/airflow/executors/__init__.py", line 60, in > _get_executor > from airflow.executors.celery_executor import CeleryExecutor > File "/incubator-airflow/airflow/executors/celery_executor.py", line 38, in > > class CeleryConfig(object): > File "/incubator-airflow/airflow/executors/celery_executor.py", line 60, in > CeleryConfig > raise AirflowException('Exception: There was an unknown Celery SSL Error. > Please ensure you want to use ' > airflow.exceptions.AirflowException: Exception: There was an unknown Celery > SSL Error. Please ensure you want to use SSL and/or have all necessary certs > and key. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (AIRFLOW-1289) Don't restrict scheduler threads to CPU cores
[ https://issues.apache.org/jira/browse/AIRFLOW-1289?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel resolved AIRFLOW-1289. -- Resolution: Fixed Fix Version/s: 1.9.0 Issue resolved by pull request #2353 [https://github.com/apache/incubator-airflow/pull/2353] > Don't restrict scheduler threads to CPU cores > - > > Key: AIRFLOW-1289 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1289 > Project: Apache Airflow > Issue Type: Bug >Reporter: Alex Guziel >Assignee: Alex Guziel > Fix For: 1.9.0 > > > There's really no reason to, and DAG processing can have blocking IO -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (AIRFLOW-1309) Add optional hive_tblproperties in HiveToDruidTransfer
Alex Guziel created AIRFLOW-1309: Summary: Add optional hive_tblproperties in HiveToDruidTransfer Key: AIRFLOW-1309 URL: https://issues.apache.org/jira/browse/AIRFLOW-1309 Project: Apache Airflow Issue Type: Improvement Reporter: Alex Guziel Assignee: Alex Guziel Priority: Minor We should accept tblproperties for the tmp table in druid -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AIRFLOW-1334) Improve efficiency of checking for backfills on scheduler loop
Alex Guziel created AIRFLOW-1334: Summary: Improve efficiency of checking for backfills on scheduler loop Key: AIRFLOW-1334 URL: https://issues.apache.org/jira/browse/AIRFLOW-1334 Project: Apache Airflow Issue Type: Improvement Reporter: Alex Guziel Assignee: Alex Guziel Right now, it makes a query for each TI, which is quite slow. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AIRFLOW-1335) Use buffered logger
Alex Guziel created AIRFLOW-1335: Summary: Use buffered logger Key: AIRFLOW-1335 URL: https://issues.apache.org/jira/browse/AIRFLOW-1335 Project: Apache Airflow Issue Type: Improvement Reporter: Alex Guziel Assignee: Alex Guziel -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AIRFLOW-1334) Improve efficiency of checking for backfills on scheduler loop
[ https://issues.apache.org/jira/browse/AIRFLOW-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel updated AIRFLOW-1334: - Fix Version/s: 1.8.3 > Improve efficiency of checking for backfills on scheduler loop > -- > > Key: AIRFLOW-1334 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1334 > Project: Apache Airflow > Issue Type: Improvement >Reporter: Alex Guziel >Assignee: Alex Guziel > Fix For: 1.8.3 > > > Right now, it makes a query for each TI, which is quite slow. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Updated] (AIRFLOW-1335) Use buffered logger
[ https://issues.apache.org/jira/browse/AIRFLOW-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel updated AIRFLOW-1335: - Fix Version/s: 1.9.0 > Use buffered logger > --- > > Key: AIRFLOW-1335 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1335 > Project: Apache Airflow > Issue Type: Improvement >Reporter: Alex Guziel >Assignee: Alex Guziel > Fix For: 1.9.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (AIRFLOW-1335) Use buffered logger
[ https://issues.apache.org/jira/browse/AIRFLOW-1335?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel resolved AIRFLOW-1335. -- Resolution: Fixed Issue resolved by pull request #2386 [https://github.com/apache/incubator-airflow/pull/2386] > Use buffered logger > --- > > Key: AIRFLOW-1335 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1335 > Project: Apache Airflow > Issue Type: Improvement >Reporter: Alex Guziel >Assignee: Alex Guziel > Fix For: 1.9.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AIRFLOW-1345) Don't commit on each loop
Alex Guziel created AIRFLOW-1345: Summary: Don't commit on each loop Key: AIRFLOW-1345 URL: https://issues.apache.org/jira/browse/AIRFLOW-1345 Project: Apache Airflow Issue Type: Improvement Reporter: Alex Guziel Assignee: Alex Guziel Fix For: 1.8.3 RIght now, in the main scheduler loop, we commit for each TI. While this minimize the time is a lock held, this expires all TIs, forcing us to do an n+1 query. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AIRFLOW-1348) Paginated UI has broken toggles after first page
[ https://issues.apache.org/jira/browse/AIRFLOW-1348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16063831#comment-16063831 ] Alex Guziel commented on AIRFLOW-1348: -- I don't think this is a new issue. It has been like this on Airbnb production for quite a while. > Paginated UI has broken toggles after first page > > > Key: AIRFLOW-1348 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1348 > Project: Apache Airflow > Issue Type: Bug >Affects Versions: 1.8.2 >Reporter: Chris Riccomini > Attachments: page1.png, page2.png > > > After upgrading to 1.8.2rc2, I'm seeing the main page paginate my list of > Airflow DAGs. Unfortunately, the toggles turn to checkboxes after the first > page. I'm attaching some screenshots to illustrate. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (AIRFLOW-1334) Improve efficiency of checking for backfills on scheduler loop
[ https://issues.apache.org/jira/browse/AIRFLOW-1334?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel resolved AIRFLOW-1334. -- Resolution: Fixed Issue resolved by pull request #2384 [https://github.com/apache/incubator-airflow/pull/2384] > Improve efficiency of checking for backfills on scheduler loop > -- > > Key: AIRFLOW-1334 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1334 > Project: Apache Airflow > Issue Type: Improvement >Reporter: Alex Guziel >Assignee: Alex Guziel > Fix For: 1.8.3 > > > Right now, it makes a query for each TI, which is quite slow. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AIRFLOW-1352) Revert bad logging Handler
Alex Guziel created AIRFLOW-1352: Summary: Revert bad logging Handler Key: AIRFLOW-1352 URL: https://issues.apache.org/jira/browse/AIRFLOW-1352 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel Right now it uses some weird API so I'll revert rather than fix -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (AIRFLOW-1352) Revert bad logging Handler
[ https://issues.apache.org/jira/browse/AIRFLOW-1352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel resolved AIRFLOW-1352. -- Resolution: Fixed Fix Version/s: 1.9.0 Issue resolved by pull request #2403 [https://github.com/apache/incubator-airflow/pull/2403] > Revert bad logging Handler > -- > > Key: AIRFLOW-1352 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1352 > Project: Apache Airflow > Issue Type: Bug >Reporter: Alex Guziel >Assignee: Alex Guziel > Fix For: 1.9.0 > > > Right now it uses some weird API so I'll revert rather than fix -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AIRFLOW-1265) Exception happens when loading celery configurations.
[ https://issues.apache.org/jira/browse/AIRFLOW-1265?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16070582#comment-16070582 ] Alex Guziel commented on AIRFLOW-1265: -- This is done but celery messed up > Exception happens when loading celery configurations. > - > > Key: AIRFLOW-1265 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1265 > Project: Apache Airflow > Issue Type: Bug > Environment: OSX >Reporter: Chienhsiung Chao >Assignee: Alex Guziel > > airflow@f300f25ced3a:/usr/local/script$ airflow > [2017-06-02 02:25:59,263] {configuration.py:199} WARNING - section/key > [celery/celery_ssl_key] not found in config > Traceback (most recent call last): > File "/incubator-airflow/airflow/executors/celery_executor.py", line 52, in > CeleryConfig > BROKER_USE_SSL = {'keyfile': configuration.get('celery', > 'CELERY_SSL_KEY'), > File "/incubator-airflow/airflow/configuration.py", line 398, in get > return conf.get(section, key, **kwargs) > File "/incubator-airflow/airflow/configuration.py", line 203, in get > "in config".format(**locals())) > airflow.exceptions.AirflowConfigException: section/key > [celery/celery_ssl_key] not found in config > During handling of the above exception, another exception occurred: > Traceback (most recent call last): > File "/usr/local/bin/airflow", line 6, in > exec(compile(open(__file__).read(), __file__, 'exec')) > File "/incubator-airflow/airflow/bin/airflow", line 18, in > from airflow.bin.cli import CLIFactory > File "/incubator-airflow/airflow/bin/cli.py", line 46, in > from airflow import jobs, settings > File "/incubator-airflow/airflow/jobs.py", line 66, in > class BaseJob(Base, LoggingMixin): > File "/incubator-airflow/airflow/jobs.py", line 98, in BaseJob > executor=executors.GetDefaultExecutor(), > File "/incubator-airflow/airflow/executors/__init__.py", line 43, in > GetDefaultExecutor > DEFAULT_EXECUTOR = _get_executor(executor_name) > File "/incubator-airflow/airflow/executors/__init__.py", line 60, in > _get_executor > from airflow.executors.celery_executor import CeleryExecutor > File "/incubator-airflow/airflow/executors/celery_executor.py", line 38, in > > class CeleryConfig(object): > File "/incubator-airflow/airflow/executors/celery_executor.py", line 60, in > CeleryConfig > raise AirflowException('Exception: There was an unknown Celery SSL Error. > Please ensure you want to use ' > airflow.exceptions.AirflowException: Exception: There was an unknown Celery > SSL Error. Please ensure you want to use SSL and/or have all necessary certs > and key. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (AIRFLOW-1059) Reset_state_for_orphaned_task should operate in batch for the scheduler
[ https://issues.apache.org/jira/browse/AIRFLOW-1059?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel resolved AIRFLOW-1059. -- Resolution: Fixed Fix Version/s: 1.9.0 Issue resolved by pull request #2205 [https://github.com/apache/incubator-airflow/pull/2205] > Reset_state_for_orphaned_task should operate in batch for the scheduler > --- > > Key: AIRFLOW-1059 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1059 > Project: Apache Airflow > Issue Type: Improvement >Reporter: Alex Guziel >Assignee: Alex Guziel > Fix For: 1.9.0 > > > Scheduler startup is very slow due to resetting state making a query for each > dag run. We should be able to do this in a constant number of queries which > will increase scheduler startup time significantly. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (AIRFLOW-1345) Don't commit on each loop
[ https://issues.apache.org/jira/browse/AIRFLOW-1345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel resolved AIRFLOW-1345. -- Resolution: Fixed Fix Version/s: (was: 1.8.3) 1.9.0 Issue resolved by pull request #2397 [https://github.com/apache/incubator-airflow/pull/2397] > Don't commit on each loop > - > > Key: AIRFLOW-1345 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1345 > Project: Apache Airflow > Issue Type: Improvement >Reporter: Alex Guziel >Assignee: Alex Guziel > Fix For: 1.9.0 > > > RIght now, in the main scheduler loop, we commit for each TI. While this > minimize the time is a lock held, this expires all TIs, forcing us to do an > n+1 query. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AIRFLOW-1438) Scheduler batch queries should have a limit
Alex Guziel created AIRFLOW-1438: Summary: Scheduler batch queries should have a limit Key: AIRFLOW-1438 URL: https://issues.apache.org/jira/browse/AIRFLOW-1438 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel Since they are in one query and there's a length limit, and they hold locks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (AIRFLOW-1438) Scheduler batch queries should have a limit
[ https://issues.apache.org/jira/browse/AIRFLOW-1438?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel resolved AIRFLOW-1438. -- Resolution: Fixed Fix Version/s: 1.9.0 Issue resolved by pull request #2462 [https://github.com/apache/incubator-airflow/pull/2462] > Scheduler batch queries should have a limit > --- > > Key: AIRFLOW-1438 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1438 > Project: Apache Airflow > Issue Type: Bug >Reporter: Alex Guziel >Assignee: Alex Guziel > Fix For: 1.9.0 > > > Since they are in one query and there's a length limit, and they hold locks. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AIRFLOW-1492) Add metric for task success/failure
Alex Guziel created AIRFLOW-1492: Summary: Add metric for task success/failure Key: AIRFLOW-1492 URL: https://issues.apache.org/jira/browse/AIRFLOW-1492 Project: Apache Airflow Issue Type: Improvement Reporter: Alex Guziel Assignee: Alex Guziel -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AIRFLOW-1493) Fix race condition with airflow run
Alex Guziel created AIRFLOW-1493: Summary: Fix race condition with airflow run Key: AIRFLOW-1493 URL: https://issues.apache.org/jira/browse/AIRFLOW-1493 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel Currently, airflow run spawns a process `airflow run --local` which spawns `airflow run --raw`. Local manages the heartbeat. Raw performs a series of checks, sets the state to running, runs the task, then sets the state to failed or success. The problem is the heartbeat check on `airflow run --local` has to monitor the state in the DB, but because the change of state to running happens asynchronously, it must first observe the state in the DB to be running before it has the power of termination. However, there is no guarantee that it will observe this state. Thus, we should move the pre-execution logic to airflow run --local -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (AIRFLOW-1492) Add metric for task success/failure
[ https://issues.apache.org/jira/browse/AIRFLOW-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel resolved AIRFLOW-1492. -- Resolution: Fixed Fix Version/s: 1.9.0 Issue resolved by pull request #2504 [https://github.com/apache/incubator-airflow/pull/2504] > Add metric for task success/failure > --- > > Key: AIRFLOW-1492 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1492 > Project: Apache Airflow > Issue Type: Improvement >Reporter: Alex Guziel >Assignee: Alex Guziel > Fix For: 1.9.0 > > -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AIRFLOW-1512) Add operator for running Python functions in a virtualenv
Alex Guziel created AIRFLOW-1512: Summary: Add operator for running Python functions in a virtualenv Key: AIRFLOW-1512 URL: https://issues.apache.org/jira/browse/AIRFLOW-1512 Project: Apache Airflow Issue Type: New Feature Components: operators Reporter: Alex Guziel Assignee: Alex Guziel -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AIRFLOW-1522) Increase size of val column for variable table in MySQL
Alex Guziel created AIRFLOW-1522: Summary: Increase size of val column for variable table in MySQL Key: AIRFLOW-1522 URL: https://issues.apache.org/jira/browse/AIRFLOW-1522 Project: Apache Airflow Issue Type: Improvement Reporter: Alex Guziel Assignee: Alex Guziel Right now, it's 64KB, which is a bit too small. This increases it to 16MB. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AIRFLOW-1569) Requeue Celery tasks in RESERVED state
Alex Guziel created AIRFLOW-1569: Summary: Requeue Celery tasks in RESERVED state Key: AIRFLOW-1569 URL: https://issues.apache.org/jira/browse/AIRFLOW-1569 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel Right now, "fair" scheduling in Celery doesn't quite work (some tasks get RESERVED, which means they will get blocked from execution even if there are open slots). We should requeue them after 2 heartbeats. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AIRFLOW-1570) Tasks that finish unnaturally don't surface error messages.
Alex Guziel created AIRFLOW-1570: Summary: Tasks that finish unnaturally don't surface error messages. Key: AIRFLOW-1570 URL: https://issues.apache.org/jira/browse/AIRFLOW-1570 Project: Apache Airflow Issue Type: Bug Reporter: Alex Guziel Assignee: Alex Guziel Right now, there are two Airflow wrapper tasks to run tasks. Local and raw (local runs raw and checks its status). Local's checks work fine when raw changes the state, but do not surface errors when raw executes abruptly (ie in the event of a SIGKILL). This should be changed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Commented] (AIRFLOW-1570) Tasks that finish unnaturally don't surface error messages.
[ https://issues.apache.org/jira/browse/AIRFLOW-1570?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16154572#comment-16154572 ] Alex Guziel commented on AIRFLOW-1570: -- This is fixed by another PR > Tasks that finish unnaturally don't surface error messages. > --- > > Key: AIRFLOW-1570 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1570 > Project: Apache Airflow > Issue Type: Bug >Reporter: Alex Guziel >Assignee: Alex Guziel > > Right now, there are two Airflow wrapper tasks to run tasks. Local and raw > (local runs raw and checks its status). Local's checks work fine when raw > changes the state, but do not surface errors when raw executes abruptly (ie > in the event of a SIGKILL). This should be changed. -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Resolved] (AIRFLOW-1493) Fix race condition with airflow run
[ https://issues.apache.org/jira/browse/AIRFLOW-1493?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel resolved AIRFLOW-1493. -- Resolution: Fixed Fix Version/s: 1.9.0 Issue resolved by pull request #2505 [https://github.com/apache/incubator-airflow/pull/2505] > Fix race condition with airflow run > --- > > Key: AIRFLOW-1493 > URL: https://issues.apache.org/jira/browse/AIRFLOW-1493 > Project: Apache Airflow > Issue Type: Bug >Reporter: Alex Guziel >Assignee: Alex Guziel > Fix For: 1.9.0 > > > Currently, airflow run spawns a process `airflow run --local` which spawns > `airflow run --raw`. > Local manages the heartbeat. Raw performs a series of checks, sets the state > to running, runs the task, then sets the state to failed or success. > The problem is the heartbeat check on `airflow run --local` has to monitor > the state in the DB, but because the change of state to running happens > asynchronously, it must first observe the state in the DB to be running > before it has the power of termination. However, there is no guarantee that > it will observe this state. Thus, we should move the pre-execution logic to > airflow run --local -- This message was sent by Atlassian JIRA (v6.4.14#64029)
[jira] [Created] (AIRFLOW-422) Expose email for ingestion by performance monitoring tools
Alex Guziel created AIRFLOW-422: --- Summary: Expose email for ingestion by performance monitoring tools Key: AIRFLOW-422 URL: https://issues.apache.org/jira/browse/AIRFLOW-422 Project: Apache Airflow Issue Type: Improvement Reporter: Alex Guziel Assignee: Alex Guziel Right now, we surface some data for third party tools (https://issues.apache.org/jira/browse/AIRFLOW-244). We want emails as well so these tools can notify the right people about jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AIRFLOW-422) Expose task information for ingestion by performance monitoring tools
[ https://issues.apache.org/jira/browse/AIRFLOW-422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel updated AIRFLOW-422: Summary: Expose task information for ingestion by performance monitoring tools (was: Expose email for ingestion by performance monitoring tools) > Expose task information for ingestion by performance monitoring tools > - > > Key: AIRFLOW-422 > URL: https://issues.apache.org/jira/browse/AIRFLOW-422 > Project: Apache Airflow > Issue Type: Improvement >Reporter: Alex Guziel >Assignee: Alex Guziel > Original Estimate: 1h > Remaining Estimate: 1h > > Right now, we surface some data for third party tools > (https://issues.apache.org/jira/browse/AIRFLOW-244). We want emails as well > so these tools can notify the right people about jobs. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AIRFLOW-422) Expose task information for ingestion by performance monitoring tools
[ https://issues.apache.org/jira/browse/AIRFLOW-422?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel updated AIRFLOW-422: Description: Right now, we surface some data for third party tools (https://issues.apache.org/jira/browse/AIRFLOW-244). We want other fields to be accessible by these tools. For example, emails will allow us to send digests. was:Right now, we surface some data for third party tools (https://issues.apache.org/jira/browse/AIRFLOW-244). We want emails as well so these tools can notify the right people about jobs. > Expose task information for ingestion by performance monitoring tools > - > > Key: AIRFLOW-422 > URL: https://issues.apache.org/jira/browse/AIRFLOW-422 > Project: Apache Airflow > Issue Type: Improvement >Reporter: Alex Guziel >Assignee: Alex Guziel > Original Estimate: 1h > Remaining Estimate: 1h > > Right now, we surface some data for third party tools > (https://issues.apache.org/jira/browse/AIRFLOW-244). We want other fields to > be accessible by these tools. For example, emails will allow us to send > digests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AIRFLOW-677) Kill zombie tasks after missing heartbeats
Alex Guziel created AIRFLOW-677: --- Summary: Kill zombie tasks after missing heartbeats Key: AIRFLOW-677 URL: https://issues.apache.org/jira/browse/AIRFLOW-677 Project: Apache Airflow Issue Type: Improvement Reporter: Alex Guziel Assignee: Alex Guziel If there's a connection error while heartbeating, it should retry. Also, if it hasn't been able to heartbeat for a while, it should kill the child processes so that we don't have 2 of the same task running. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AIRFLOW-679) Stop concurrent task instances from running due to race conditions
Alex Guziel created AIRFLOW-679: --- Summary: Stop concurrent task instances from running due to race conditions Key: AIRFLOW-679 URL: https://issues.apache.org/jira/browse/AIRFLOW-679 Project: Apache Airflow Issue Type: Improvement Reporter: Alex Guziel Assignee: Alex Guziel Priority: Minor Right now, multiple copies of the same task instance can run if someone clicks on the UI multiple times. To fix this, I propose two things: 1) Use a transaction to set state to running, and don't run otherwise 2) record hostname and pid in TaskInstance table, then when heartbeating, only continue running if it matches -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Closed] (AIRFLOW-677) Kill zombie tasks after missing heartbeats
[ https://issues.apache.org/jira/browse/AIRFLOW-677?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel closed AIRFLOW-677. --- Resolution: Fixed > Kill zombie tasks after missing heartbeats > -- > > Key: AIRFLOW-677 > URL: https://issues.apache.org/jira/browse/AIRFLOW-677 > Project: Apache Airflow > Issue Type: Improvement >Reporter: Alex Guziel >Assignee: Alex Guziel > > If there's a connection error while heartbeating, it should retry. Also, if > it hasn't been able to heartbeat for a while, it should kill the child > processes so that we don't have 2 of the same task running. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AIRFLOW-679) Stop concurrent task instances from running due to race conditions
[ https://issues.apache.org/jira/browse/AIRFLOW-679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel updated AIRFLOW-679: Description: Right now, multiple copies of the same task instance can run if someone clicks on the UI multiple times. To fix this, I propose two things: 1) record hostname and pid in TaskInstance table, then when heartbeating, only continue running if it matches was: Right now, multiple copies of the same task instance can run if someone clicks on the UI multiple times. To fix this, I propose two things: 1) Use a transaction to set state to running, and don't run otherwise 2) record hostname and pid in TaskInstance table, then when heartbeating, only continue running if it matches > Stop concurrent task instances from running due to race conditions > -- > > Key: AIRFLOW-679 > URL: https://issues.apache.org/jira/browse/AIRFLOW-679 > Project: Apache Airflow > Issue Type: Improvement >Reporter: Alex Guziel >Assignee: Alex Guziel >Priority: Minor > > Right now, multiple copies of the same task instance can run if someone > clicks on the UI multiple times. To fix this, I propose two things: > 1) record hostname and pid in TaskInstance table, then when heartbeating, > only continue running if it matches -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (AIRFLOW-696) Monitor queue lengths in CeleryExecutor
[ https://issues.apache.org/jira/browse/AIRFLOW-696?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Alex Guziel updated AIRFLOW-696: Summary: Monitor queue lengths in CeleryExecutor (was: Monitoring for queue lengths in CeleryExecutor) > Monitor queue lengths in CeleryExecutor > --- > > Key: AIRFLOW-696 > URL: https://issues.apache.org/jira/browse/AIRFLOW-696 > Project: Apache Airflow > Issue Type: Improvement >Reporter: Alex Guziel >Assignee: Alex Guziel > > Monitor queue lengths for CeleryExecutor. This will make it easier to see how > much of the cluster is being used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AIRFLOW-696) Monitoring for queue lengths in CeleryExecutor
Alex Guziel created AIRFLOW-696: --- Summary: Monitoring for queue lengths in CeleryExecutor Key: AIRFLOW-696 URL: https://issues.apache.org/jira/browse/AIRFLOW-696 Project: Apache Airflow Issue Type: Improvement Reporter: Alex Guziel Assignee: Alex Guziel Monitor queue lengths for CeleryExecutor. This will make it easier to see how much of the cluster is being used. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (AIRFLOW-1601) Add configurable time between SIGTERM and SIGKILL in task killer
Alex Guziel created AIRFLOW-1601: Summary: Add configurable time between SIGTERM and SIGKILL in task killer Key: AIRFLOW-1601 URL: https://issues.apache.org/jira/browse/AIRFLOW-1601 Project: Apache Airflow Issue Type: Improvement Reporter: Alex Guziel Assignee: Alex Guziel Right now, you can only wait 5 seconds, which might not be enough for some things to clean up. -- This message was sent by Atlassian JIRA (v6.4.14#64029)