[jira] [Work logged] (BEAM-12246) ib.collect doesn't preserve the index from DeferredDataFrame instances

ASF GitHub Bot (Jira) Wed, 12 May 2021 12:57:15 -0700


     [ 
https://issues.apache.org/jira/browse/BEAM-12246?focusedWorklogId=595592&page=com.atlassian.jira.plugin.system.issuetabpanels:worklog-tabpanel#worklog-595592
 ]


ASF GitHub Bot logged work on BEAM-12246:
-----------------------------------------

                Author: ASF GitHub Bot
            Created on: 12/May/21 19:56
            Start Date: 12/May/21 19:56
    Worklog Time Spent: 10m 
      Work Description: rohdesamuel commented on a change in pull request 
#14778:
URL: https://github.com/apache/beam/pull/14778#discussion_r631355891



##########
File path: sdks/python/apache_beam/runners/interactive/recording_manager.py
##########
@@ -298,8 +298,12 @@ def _watch(self, pcolls):
           watched_pcollections.add(val)
         elif isinstance(val, DeferredBase):
           watched_dataframes.add(val)
-    # Convert them all in a single step for efficiency.
-    for pcoll in to_pcollection(*watched_dataframes, always_return_tuple=True):
+
+    # Convert them one-by-one to generate a unique label for each. This allows
+    # caching at a more fine-grained granularity.
+    for df in watched_dataframes:
+      pcoll = to_pcollection(
+          df, yield_elements='pandas', label=str(id(df._expr._id)))

Review comment:
       Right, I found that the default label generated from to_pcollection 
wasn't unique enough for notebooks. If you take a look at the 
InteractiveRunnerTest.test_dataframes_same_cell_twice test, then you can see 
that the default label is the same for both df['square'] and df['cube']. I can 
take a look in the _var_name method, maybe something wrong is there.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


Issue Time Tracking
-------------------

    Worklog Id:     (was: 595592)
    Time Spent: 4h 20m  (was: 4h 10m)

> ib.collect doesn't preserve the index from DeferredDataFrame instances
> ----------------------------------------------------------------------
>
>                 Key: BEAM-12246
>                 URL: https://issues.apache.org/jira/browse/BEAM-12246
>             Project: Beam
>          Issue Type: Bug
>          Components: sdk-py-core
>    Affects Versions: 2.29.0
>            Reporter: Brian Hulette
>            Assignee: Sam Rohde
>            Priority: P2
>              Labels: dataframe-api
>          Time Spent: 4h 20m
>  Remaining Estimate: 0h
>
> This happens because it use {{to_pcollection(yield='schemas', 
> include_indexes=False)}} (the default values for those arguments). To fix 
> this we should avoid converting to beam schemas and collect the raw 
> dataframes with {{to_pcollectiion(yield='pandas')}}.
> See https://github.com/apache/beam/pull/14356#discussion_r620647659



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Work logged] (BEAM-12246) ib.collect doesn't preserve the index from DeferredDataFrame instances

Reply via email to