turbaszek commented on a change in pull request #14492:
URL: https://github.com/apache/airflow/pull/14492#discussion_r584102199



##########
File path: airflow/providers/airbyte/hooks/airbyte.py
##########
@@ -0,0 +1,92 @@
+#
+# Licensed to the Apache Software Foundation (ASF) under one
+# or more contributor license agreements.  See the NOTICE file
+# distributed with this work for additional information
+# regarding copyright ownership.  The ASF licenses this file
+# to you under the Apache License, Version 2.0 (the
+# "License"); you may not use this file except in compliance
+# with the License.  You may obtain a copy of the License at
+#
+#   http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing,
+# software distributed under the License is distributed on an
+# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
+# KIND, either express or implied.  See the License for the
+# specific language governing permissions and limitations
+# under the License.
+import time
+from typing import Optional
+
+from airflow.exceptions import AirflowException
+from airflow.providers.http.hooks.http import HttpHook
+
+
+class AirbyteJobController:
+    """Airbyte job status"""
+
+    RUNNING = "running"
+    SUCCEEDED = "succeeded"
+    CANCELLED = "canceled"
+    PENDING = "pending"
+    FAILED = "failed"
+    ERROR = "error"
+
+
+class AirbyteHook(HttpHook, AirbyteJobController):
+    """Hook for Airbyte API"""
+
+    def __init__(self, airbyte_conn_id: str) -> None:
+        super().__init__(http_conn_id=airbyte_conn_id)
+
+    def wait_for_job(self, job_id: str, wait_time: int = 3, timeout: 
Optional[int] = None) -> None:

Review comment:
       >  is there an example of this you can point me to @turbaszek?
   
   Configurable waiting:
   
https://github.com/apache/airflow/blob/13854c32a38787af6d8a52ab2465cb6185c0b74c/airflow/providers/google/cloud/operators/dataproc.py#L895
   
https://github.com/apache/airflow/blob/13854c32a38787af6d8a52ab2465cb6185c0b74c/airflow/providers/google/cloud/operators/dlp.py#L263
   
https://github.com/apache/airflow/blob/13854c32a38787af6d8a52ab2465cb6185c0b74c/airflow/providers/google/cloud/operators/dataflow.py#L93
   
   Waiting in operator:
   
https://github.com/apache/airflow/blob/13854c32a38787af6d8a52ab2465cb6185c0b74c/airflow/providers/google/cloud/operators/datafusion.py#L104
   in general search for any `.result()` in GCP that does polling
   
https://github.com/apache/airflow/blob/13854c32a38787af6d8a52ab2465cb6185c0b74c/airflow/providers/google/cloud/operators/bigquery.py#L2143
   
   > i'm curious how it has been implemented with google.
   
   I assume your question is mostly about cases like reruning tasks. So the 
answer is that idempotence is the key... some examples:
   
https://github.com/apache/airflow/blob/13854c32a38787af6d8a52ab2465cb6185c0b74c/airflow/providers/google/cloud/operators/bigquery.py#L2030-L2037
   
   This of course requires an API that allows you do some idempotency handling 
and this of course is not always possible. As a rule of thumb this can be done 
by:
   1. Generating unique id which is deterministic using (dag_id, task_id, 
exec_date) and includes some hash of users input
   1. Reattaching to existing jobs/operations
   
   That of course has a lot of edge cases like "reattach only to running jobs" 
or let users override the id of operation.
   
   I'm personally leaning to `op >> sensor` approach but many users want to do 
"atomic" operations like `create` or `submit`




----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Reply via email to