Re: [PR] Refine Type Handling in OpenAI Embedding Operator to Match OpenAI Typings [airflow]

via GitHub Mon, 13 Nov 2023 06:23:31 -0800


ylnsnv commented on code in PR #35547:
URL: https://github.com/apache/airflow/pull/35547#discussion_r1391177321



##########
airflow/providers/openai/operators/openai.py:
##########
@@ -31,41 +32,46 @@ class OpenAIEmbeddingOperator(BaseOperator):
     """
     Operator that accepts input text to generate OpenAI embeddings using the 
specified model.
 
+    :param conn_id: The OpenAI connection ID to use.
+    :param input_text: The text to generate OpenAI embeddings for. This can be 
a string, a list of strings,
+                    a list of integers, or a list of lists of integers.
+    :param model: The OpenAI model to be used for generating the embeddings.
+    :param embedding_kwargs: Additional keyword arguments to pass to the 
OpenAI `create_embeddings` method.
+
     .. seealso::
         For more information on how to use this operator, take a look at the 
guide:
         :ref:`howto/operator:OpenAIEmbeddingOperator`
-
-    :param conn_id: The OpenAI connection.
-    :param input_text: The text to generate OpenAI embeddings on. Either 
input_text or input_callable
-        should be provided.
-    :param model: The OpenAI model to be used for generating the embeddings.
-    :param embedding_kwargs: For possible option check
-        .. seealso:: 
https://platform.openai.com/docs/api-reference/embeddings/create
+        For possible options for `embedding_kwargs`, see:
+        https://platform.openai.com/docs/api-reference/embeddings/create
     """
 
     template_fields: Sequence[str] = ("input_text",)
 
     def __init__(
         self,
         conn_id: str,
-        input_text: str | list[Any],
+        input_text: str | list[str] | list[int] | list[list[int]],
         model: str = "text-embedding-ada-002",
         embedding_kwargs: dict | None = None,
         **kwargs: Any,
     ):
-        self.embedding_kwargs = embedding_kwargs or {}
         super().__init__(**kwargs)
         self.conn_id = conn_id
         self.input_text = input_text
         self.model = model
+        self.embedding_kwargs = embedding_kwargs or {}
 
     @cached_property
     def hook(self) -> OpenAIHook:
         """Return an instance of the OpenAIHook."""
         return OpenAIHook(conn_id=self.conn_id)
 
     def execute(self, context: Context) -> list[float]:
-        self.log.info("Input text: %s", self.input_text)
+        if not self.input_text or not isinstance(self.input_text, (str, list)):

Review Comment:
   So I've carefully considered your suggestion about relocating the validation 
to the `__init__` method. However, there's a potential issue with this 
approach, particularly in the context of how Airflow handles dynamic inputs 
through XComs and templating.
   
   In the `example_openai_dag` within 
`airflow/tests/system/providers/openai/example_openai.py`, we have a scenario 
where the `input_text` for `OpenAIEmbeddingOperator` is dynamically determined 
at runtime. This is achieved by pulling a value from XCom that is produced by 
another task (`task_to_store_input_text_in_xcom`). If we place the validation 
in the `__init__` method, it would attempt to validate this dynamic reference 
(e.g., `{{ task_instance.xcom_pull(task_ids='task_to_store_input_text_in_xcom') 
}}`) as a literal value, which is not feasible since the actual value is only 
resolved during task execution.
   
   This leads to a dilemma: if we keep the validation in `__init__`, we might 
reject valid DAG definitions that use dynamic inputs. On the other hand, 
placing the validation in the `execute` method allows for more flexibility, as 
it validates the actual runtime value after all templating and XCom pulls have 
been resolved.
   
   I've adjusted the test to reflect this understanding, but I'm open to 
further discussion. The key question is whether we prioritize strict validation 
at DAG parsing time (with potential limitations on dynamic inputs) or more 
flexible, runtime validation that accommodates dynamic DAG patterns. Your 
insights on this trade-off, especially in the context of our current test cases 
and typical use cases for the `OpenAIEmbeddingOperator`, would be greatly 
appreciated.
   Thanks again! :)



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Refine Type Handling in OpenAI Embedding Operator to Match OpenAI Typings [airflow]

Reply via email to