[GitHub] [beam] damccorm commented on a diff in pull request #25919: Allow PTransforms to be applied directly to dataframes.

via GitHub Wed, 22 Mar 2023 07:57:13 -0700


damccorm commented on code in PR #25919:
URL: https://github.com/apache/beam/pull/25919#discussion_r1144933210



##########
sdks/python/gen_protos.py:
##########
@@ -123,7 +123,19 @@ def generate_urn_files(out_dir, api_path):
   This is executed at build time rather than dynamically on import to ensure
   that it is compatible with static type checkers like mypy.
   """
-  from google._upb import _message
+  try:
+    from google._upb import _message
+    list_types = (
+        list,
+        _message.RepeatedScalarContainer,
+        _message.RepeatedCompositeContainer,
+    )  # pylint: disable=c-extension-no-member
+  except ImportError:
+    from google.protobuf.internal import containers
+    list_types = (
+        list,
+        containers.RepeatedScalarFieldContainer,
+        containers.RepeatedCompositeFieldContainer)

Review Comment:
   Trying to follow why we need this - is this from having an old protobuf 
installed? It looks like we previously used `google.protobuf.pyext._message` 
before 
https://github.com/apache/beam/commit/5cb1711c39c2ebc3005363f938ca4a4eee774333, 
is there a reason we're now using `google.protobuf.internal.containers`?
   
   cc/ @AnandInguva @tvalentyn 



##########
sdks/python/apache_beam/dataframe/frames.py:
##########
@@ -5394,6 +5396,24 @@ def func(df, *args, **kwargs):
         frame_base._elementwise_method(inplace_name, inplace=True,
                                        base=pd.DataFrame))
 
+# Allow dataframe | SchemaTransform
+def _create_maybe_elementwise_or(base):
+  elementwise = frame_base._elementwise_method(
+      '__or__', restrictions={'level': None}, base=base)
+
+  def _maybe_elementwise_or(self, right):
+    if isinstance(right, PTransform):
+      return convert.to_dataframe(convert.to_pcollection(self) | right)

Review Comment:
   I think this is fine, it is worth calling out that we're opening users up to 
doing an inefficient thing where they go:
   
   ```
   df = convert.to_dataframe(pc)
   df2 = df | MyDfOperation()
   df3 = df2 | MySchemaTransform() | MySchemaTransform2() | MySchemaTransform3()
   result = convert.to_dataframe(df3)
   ```
   
   which would be (much?) more efficient written as:
   
   ```
   df = convert.to_dataframe(pc)
   df2 = df | MyDfOperation()
   result = convert.to_dataframe(df2) | MySchemaTransform() | 
MySchemaTransform2() | MySchemaTransform3()
   ```
   
   because this avoids the repeated `to_dataframe`/`to_pcollection` transition. 
The former is probably more natural though, especially if you have real df 
operations mixed in there even if its less efficient. I think the user 
experience still trumps the efficiency loss, but it might be something we want 
to doc.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] damccorm commented on a diff in pull request #25919: Allow PTransforms to be applied directly to dataframes.

Reply via email to