Re: [PR] Fix ml transform [beam]

via GitHub Thu, 07 Aug 2025 07:43:19 -0700


damccorm commented on code in PR #35774:
URL: https://github.com/apache/beam/pull/35774#discussion_r2260544316



##########
sdks/python/apache_beam/yaml/yaml_ml.py:
##########
@@ -514,11 +509,36 @@ def ml_transform(
   options.YamlOptions.check_enabled(pcoll.pipeline, 'ML')
   # TODO(robertwb): Perhaps _config_to_obj could be pushed into MLTransform
   # itself for better cross-language support?
-  return pcoll | MLTransform(
+  result = pcoll | MLTransform(
       write_artifact_location=write_artifact_location,
       read_artifact_location=read_artifact_location,
       transforms=[_config_to_obj(t) for t in transforms] if transforms else [])
 
+  if transforms and any(t.get('type') == 'SentenceTransformerEmbeddings'
+                        for t in transforms):
+    from apache_beam.typehints import List
+    try:
+      if pcoll.element_type:
+        new_fields = named_fields_from_element_type(pcoll.element_type)
+        columns_to_change = set()
+        for t_spec in transforms:
+          if t_spec.get('type') == 'SentenceTransformerEmbeddings':
+            columns_to_change.update(
+                t_spec.get('config', {}).get('columns', []))
+
+        final_fields = []
+        for name, typ in new_fields:
+          if name in columns_to_change:
+            final_fields.append((name, List[float]))
+          else:
+            final_fields.append((name, typ))
+        output_schema = RowTypeConstraint.from_fields(final_fields)
+        return result | beam.Map(lambda x: x).with_output_types(output_schema)

Review Comment:
   Rather than applying an extra identity map, could we store the transform 
above with something like:
   
   ```
   to_be_applied = MLTransform(
         write_artifact_location=write_artifact_location,
         read_artifact_location=read_artifact_location,
         transforms=[_config_to_obj(t) for t in transforms] if transforms else 
[])
   ```
   
   Then here we could:
   
   ```
   return result | to_be_applied.with_output_types(output_schema)
   ```



##########
sdks/python/apache_beam/yaml/yaml_ml.py:
##########
@@ -514,11 +509,36 @@ def ml_transform(
   options.YamlOptions.check_enabled(pcoll.pipeline, 'ML')
   # TODO(robertwb): Perhaps _config_to_obj could be pushed into MLTransform
   # itself for better cross-language support?
-  return pcoll | MLTransform(
+  result = pcoll | MLTransform(
       write_artifact_location=write_artifact_location,
       read_artifact_location=read_artifact_location,
       transforms=[_config_to_obj(t) for t in transforms] if transforms else [])
 
+  if transforms and any(t.get('type') == 'SentenceTransformerEmbeddings'

Review Comment:
   We can probably get rid of the tft check above (or at least modify it) since 
you don't need tft installed for embeddings



##########
sdks/python/apache_beam/yaml/yaml_ml.py:
##########
@@ -514,11 +509,36 @@ def ml_transform(
   options.YamlOptions.check_enabled(pcoll.pipeline, 'ML')
   # TODO(robertwb): Perhaps _config_to_obj could be pushed into MLTransform
   # itself for better cross-language support?
-  return pcoll | MLTransform(
+  result = pcoll | MLTransform(
       write_artifact_location=write_artifact_location,
       read_artifact_location=read_artifact_location,
       transforms=[_config_to_obj(t) for t in transforms] if transforms else [])
 
+  if transforms and any(t.get('type') == 'SentenceTransformerEmbeddings'
+                        for t in transforms):

Review Comment:
   Can we do a generic match of anything ending in `Embeddings`?



##########
sdks/python/apache_beam/yaml/yaml_ml.py:
##########
@@ -514,11 +509,36 @@ def ml_transform(
   options.YamlOptions.check_enabled(pcoll.pipeline, 'ML')
   # TODO(robertwb): Perhaps _config_to_obj could be pushed into MLTransform
   # itself for better cross-language support?
-  return pcoll | MLTransform(
+  result = pcoll | MLTransform(
       write_artifact_location=write_artifact_location,
       read_artifact_location=read_artifact_location,
       transforms=[_config_to_obj(t) for t in transforms] if transforms else [])

Review Comment:
   Will _config_to_obj work? I don't think it will right now.



##########
sdks/python/apache_beam/yaml/yaml_ml.py:
##########
@@ -514,11 +509,36 @@ def ml_transform(
   options.YamlOptions.check_enabled(pcoll.pipeline, 'ML')
   # TODO(robertwb): Perhaps _config_to_obj could be pushed into MLTransform
   # itself for better cross-language support?
-  return pcoll | MLTransform(
+  result = pcoll | MLTransform(
       write_artifact_location=write_artifact_location,
       read_artifact_location=read_artifact_location,
       transforms=[_config_to_obj(t) for t in transforms] if transforms else [])

Review Comment:
   Oh I see, it will work but only if tft is installed. Again, so we should 
probably also address that so it is not required.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] Fix ml transform [beam]

Reply via email to