[GitHub] [beam] TheNeuralBit commented on a change in pull request #14356: Better dataframe support for beam notebooks.

GitBox Mon, 26 Apr 2021 14:03:51 -0700


TheNeuralBit commented on a change in pull request #14356:
URL: https://github.com/apache/beam/pull/14356#discussion_r620647659




##########
File path: sdks/python/apache_beam/runners/interactive/recording_manager.py
##########
@@ -289,10 +291,16 @@ def _watch(self, pcolls):
     """
 
     watched_pcollections = set()
+    watched_dataframes = set()
     for watching in ie.current_env().watching():
       for _, val in watching:
         if isinstance(val, beam.pvalue.PCollection):
           watched_pcollections.add(val)
+        elif isinstance(val, DeferredBase):
+          watched_dataframes.add(val)
+    # Convert them all in a single step for efficiency.
+    for pcoll in to_pcollection(*watched_dataframes, always_return_tuple=True):
+      watched_pcollections.add(pcoll)

Review comment:
       I think it would be preferable to use `to_pcollection(yield='pandas')` 
here and then just concat the raw DataFrames we collect. That would avoid 
issues with the imperfect mapping back to beam schemas (e.g. pandas allows 
duplicate column names, and to_pcollection doesn't have a good story for 
mapping the index) 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] TheNeuralBit commented on a change in pull request #14356: Better dataframe support for beam notebooks.

Reply via email to