[GitHub] [beam] damccorm opened a new issue, #20875: Memoize DataFrame operations

GitBox Sat, 04 Jun 2022 13:02:36 -0700


damccorm opened a new issue, #20875:
URL: https://github.com/apache/beam/issues/20875


   Currently performing an operation on a deferred dataframe always produces a 
_new_ deferred dataframe. This means a call like to_pcollection(df.mean(), 
df.mean()), will produce two distinct PCollections duplicating the same 
computation.
   
   This is particularly problematic for the interactive use-case where, 
to_pcollection is used inside of ib.collect() in combination with PCollection 
caching. Collecting df.mean() two different times will duplicate the 
computation unnecessarily.
   
   We should cache the output expressions produced by operations to prevent 
this.
   
   We need to be mindful of inplace operations when implementing this:
   - Two calls to df.mean() should produce the same result iff df has not been 
mutated in between.
   - If the output of one call to df.mean() is mutated, it must not mutate the 
output of another call to df.mean().
   
   Imported from Jira 
[BEAM-12245](https://issues.apache.org/jira/browse/BEAM-12245). Original Jira 
may contain additional context.
   Reported by: bhulette.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [beam] damccorm opened a new issue, #20875: Memoize DataFrame operations

Reply via email to