damccorm commented on code in PR #35677:
URL: https://github.com/apache/beam/pull/35677#discussion_r2298486500


##########
sdks/python/apache_beam/ml/transforms/embeddings/vertex_ai.py:
##########
@@ -281,3 +294,194 @@ def get_ptransform_for_processing(self, **kwargs) -> 
beam.PTransform:
     return RunInference(
         model_handler=_ImageEmbeddingHandler(self),
         inference_args=self.inference_args)
+
+
+@dataclass
+class VertexAIMultiModalInput:
+  image: Optional[Image] = None
+  video: Optional[Video] = None
+  contextual_text: Optional[str] = None

Review Comment:
   > It is not part of the image object today, but given how get_embeddings() 
works there wouldn't be a way to specify that otherwise unless they really 
expanded the function signature.
   
   Yeah, I recognize that there is no way of specifying this today. The 
advantage of using our own dataclass to capture information like this is that 
we could support it in a forward-compatible way through a single config object 
in the future regardless of how vertex's client decides to support it.
   
   > There was some conversation with @claudevdm around expanding Chunks to be 
capable of containing multimodal data and writing a separate input/output 
adapter pair for them, but I suppose there's separate merit around just 
accepting text Chunks here.
   
   I would lean towards unifying this now rather than trying to have an extra 
way of doing this wrapping in the future. That will also allow this to play 
nicely with writing (text) to a vector db, though we'd still need to do similar 
work for images/videos eventually



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to