[GitHub] [beam] TheNeuralBit commented on a change in pull request #14778: [BEAM-12246] Fix ib.collect(dataframe) indexing

GitBox Tue, 11 May 2021 15:32:19 -0700


TheNeuralBit commented on a change in pull request #14778:
URL: https://github.com/apache/beam/pull/14778#discussion_r630582170




##########
File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
##########
@@ -529,6 +534,11 @@ def collect(pcoll, n='inf', duration='inf', 
include_window_info=False):
     n: (optional) max number of elements to visualize. Default 'inf'.
     duration: (optional) max duration of elements to read in integer seconds or
         a string duration. Default 'inf'.
+    include_window_info: (optional) if True, appends the windowing information
+        to each row. Default False.
+    reset_unnamed_indexes: (optional) If True, resets unnamed indices. This is
+        useful because the Beam DataFrame model has non-deterministic index
+        values for DataFrames with unnamed indexes. Default True.

Review comment:
       This seems odd to me. Could you clarify this:
   
   > There is one problem with passing dataframes, however, which is that the 
indexing isn't kept and is reset per bundle. 
   
   I'm not aware of logic that resets the index, this could be a bug.

##########
File path: sdks/python/apache_beam/runners/interactive/recording_manager.py
##########
@@ -298,8 +298,12 @@ def _watch(self, pcolls):
           watched_pcollections.add(val)
         elif isinstance(val, DeferredBase):
           watched_dataframes.add(val)
-    # Convert them all in a single step for efficiency.
-    for pcoll in to_pcollection(*watched_dataframes, always_return_tuple=True):
+
+    # Convert them one-by-one to generate a unique label for each. This allows
+    # caching at a more fine-grained granularity.
+    for df in watched_dataframes:
+      pcoll = to_pcollection(
+          df, yield_elements='pandas', label=str(id(df._expr._id)))

Review comment:
       Does converting all the DataFrames at once make PCollection caching not 
work? If so, can we make a change in `to_pcollection` to fix it?
   
   As noted in the existing code, converting in a single step can be much more 
efficient.

##########
File path: sdks/python/apache_beam/runners/interactive/interactive_beam.py
##########
@@ -516,8 +516,13 @@ def show(
       recording.cancel()
 
 
-@progress_indicated
-def collect(pcoll, n='inf', duration='inf', include_window_info=False):
+# @progress_indicated

Review comment:
       Is this intentional?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] TheNeuralBit commented on a change in pull request #14778: [BEAM-12246] Fix ib.collect(dataframe) indexing

Reply via email to