damccorm commented on code in PR #35677:
URL: https://github.com/apache/beam/pull/35677#discussion_r2323335789


##########
sdks/python/apache_beam/ml/transforms/embeddings/vertex_ai.py:
##########
@@ -281,3 +297,212 @@ def get_ptransform_for_processing(self, **kwargs) -> 
beam.PTransform:
     return RunInference(
         model_handler=_ImageEmbeddingHandler(self),
         inference_args=self.inference_args)
+
+
+@dataclass
+class VertexImage:
+  image: Image
+  embedding: Optional[list[float]] = None
+
+
+@dataclass
+class VertexVideo:
+  video: Video
+  config: VideoSegmentConfig
+  embeddings: Optional[list[VideoEmbedding]] = None

Review Comment:
   IMO, it makes sense to structure this so that your input is a `VertexVideo` 
with no embeddings attached, and your output is a `VertexVideo` with embeddings 
attached (following the pattern we use with `Chunk`, for example in this 
[notebook 
overview](https://cloud.google.com/dataflow/docs/notebooks/alloydb_product_catalog_embeddings#pipeline_overview)).
 This (along with corresponding types for text/image) would be wrapped in your 
`VertexAIMultiModalData`. Then you are only ever working with one familiar data 
type, which should be pretty straightforward to construct.
   
   Thoughts?



##########
sdks/python/apache_beam/ml/transforms/embeddings/vertex_ai.py:
##########
@@ -281,3 +297,212 @@ def get_ptransform_for_processing(self, **kwargs) -> 
beam.PTransform:
     return RunInference(
         model_handler=_ImageEmbeddingHandler(self),
         inference_args=self.inference_args)
+
+
+@dataclass
+class VertexImage:
+  image: Image
+  embedding: Optional[list[float]] = None
+
+
+@dataclass
+class VertexVideo:
+  video: Video
+  config: VideoSegmentConfig
+  embeddings: Optional[list[VideoEmbedding]] = None

Review Comment:
   I'm a bit confused - it seems like we're not ever using this field?
   
   Rather than just returning the generated embeddings as part of the output, I 
think it makes sense to return this as an output object. That way, downstream 
users can access the original video alongside the embeddings.
   
   Usually, I'd expect people to write embeddings to a vector db with a 
reference to the video stored somewhere, so they would need both the embeddings 
and the original input data to do that.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to