[GitHub] [airflow] jaketf edited a comment on issue #10381: Add on_kill method to DataprocSubmitJobOperator

GitBox Tue, 18 Aug 2020 11:57:22 -0700


jaketf edited a comment on issue #10381:
URL: https://github.com/apache/airflow/issues/10381#issuecomment-675654161



   Dataproc jobs are kind of a wild wild west and may have significant side 
effects. From a documentation perspective we should call out that `on_kill` 
simply kills the job but will not "roll back" changes in external systems (GCS, 
Hive Metastore, BQ, pubsub, etc) that may have occurred. Users should be 
careful to handle any such scenarios in the logic of their pipelines. This may 
seem obvious to us but may not be clear to users (if we contrast to a BigQuery 
query job where everything is controlled internally and if a job is cancelled 
nothing happens because all results are committed atomically).
   
   A Few examples
   - even before completing as a spark driver could make arbitrary calls 
mutation data on GCS or a database (e.g. could write some sort of lock file 
that ends up being abandoned).
   - If you snipe a map reduce job in the middle and any intermediate files we 
flushed to GCS those will not get cleaned up.
   - a hive jobs can contain multiple query statements (e.g. a CREATE TABLE and 
a INSERT INTO) which may leave a side effect of a new empty table in hive 
metastore
   - sniping a spark streaming job subscribing to pubsub may lead to ACKed 
messages who's corresponding outputs were not committed.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [airflow] jaketf edited a comment on issue #10381: Add on_kill method to DataprocSubmitJobOperator

Reply via email to