turbaszek commented on a change in pull request #14492: URL: https://github.com/apache/airflow/pull/14492#discussion_r584102199
########## File path: airflow/providers/airbyte/hooks/airbyte.py ########## @@ -0,0 +1,92 @@ +# +# Licensed to the Apache Software Foundation (ASF) under one +# or more contributor license agreements. See the NOTICE file +# distributed with this work for additional information +# regarding copyright ownership. The ASF licenses this file +# to you under the Apache License, Version 2.0 (the +# "License"); you may not use this file except in compliance +# with the License. You may obtain a copy of the License at +# +# http://www.apache.org/licenses/LICENSE-2.0 +# +# Unless required by applicable law or agreed to in writing, +# software distributed under the License is distributed on an +# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY +# KIND, either express or implied. See the License for the +# specific language governing permissions and limitations +# under the License. +import time +from typing import Optional + +from airflow.exceptions import AirflowException +from airflow.providers.http.hooks.http import HttpHook + + +class AirbyteJobController: + """Airbyte job status""" + + RUNNING = "running" + SUCCEEDED = "succeeded" + CANCELLED = "canceled" + PENDING = "pending" + FAILED = "failed" + ERROR = "error" + + +class AirbyteHook(HttpHook, AirbyteJobController): + """Hook for Airbyte API""" + + def __init__(self, airbyte_conn_id: str) -> None: + super().__init__(http_conn_id=airbyte_conn_id) + + def wait_for_job(self, job_id: str, wait_time: int = 3, timeout: Optional[int] = None) -> None: Review comment: > is there an example of this you can point me to @turbaszek? Configurable waiting: https://github.com/apache/airflow/blob/13854c32a38787af6d8a52ab2465cb6185c0b74c/airflow/providers/google/cloud/operators/dataproc.py#L895 https://github.com/apache/airflow/blob/13854c32a38787af6d8a52ab2465cb6185c0b74c/airflow/providers/google/cloud/operators/dlp.py#L263 https://github.com/apache/airflow/blob/13854c32a38787af6d8a52ab2465cb6185c0b74c/airflow/providers/google/cloud/operators/dataflow.py#L93 Waiting in operator: https://github.com/apache/airflow/blob/13854c32a38787af6d8a52ab2465cb6185c0b74c/airflow/providers/google/cloud/operators/datafusion.py#L104 in general search for any `.result()` in GCP that does polling https://github.com/apache/airflow/blob/13854c32a38787af6d8a52ab2465cb6185c0b74c/airflow/providers/google/cloud/operators/bigquery.py#L2143 > i'm curious how it has been implemented with google. I assume your question is mostly about cases like reruning tasks. So the answer is that idempotence is the key... some examples: https://github.com/apache/airflow/blob/13854c32a38787af6d8a52ab2465cb6185c0b74c/airflow/providers/google/cloud/operators/bigquery.py#L2030-L2037 This of course requires an API that allows you do some idempotency handling and this of course is not always possible. As a rule of thumb we this can be done by: 1. Generating unique id which is deterministic using (dag_id, task_id, exec_date) and includes some hash of users input 1. Reattaching to existing jobs/operations That of course has a lot of edge cases like "reattach only to running jobs" or let users override the id of operation. I'm personally leaning to `op >> sensor` approach but many users want to do "atomic" operations like `create` or `submit` ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected]
