TheNeuralBit commented on a change in pull request #14356:
URL: https://github.com/apache/beam/pull/14356#discussion_r620647659
##########
File path: sdks/python/apache_beam/runners/interactive/recording_manager.py
##########
@@ -289,10 +291,16 @@ def _watch(self, pcolls):
"""
watched_pcollections = set()
+ watched_dataframes = set()
for watching in ie.current_env().watching():
for _, val in watching:
if isinstance(val, beam.pvalue.PCollection):
watched_pcollections.add(val)
+ elif isinstance(val, DeferredBase):
+ watched_dataframes.add(val)
+ # Convert them all in a single step for efficiency.
+ for pcoll in to_pcollection(*watched_dataframes, always_return_tuple=True):
+ watched_pcollections.add(pcoll)
Review comment:
I think it would be preferable to use `to_pcollection(yield='pandas')`
here and then just concat the raw DataFrames we collect. That would avoid
issues with the imperfect mapping back to beam schemas (e.g. pandas allows
duplicate column names, and to_pcollection doesn't have a good story for
mapping the index)
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]