cache() s...

MLnick Wed, 10 Aug 2016 11:59:22 -0700

Github user MLnick commented on the issue:

    https://github.com/apache/spark/pull/14579
  
    yeah it would break pipelining - I don't think it will necessarily throw an 
error though.
    
    e.g.
    
    ```
    In [22]: rdd = sc.parallelize(["b", "a", "c"])
    
    In [23]: type(rdd)
    Out[23]: pyspark.rdd.RDD
    
    In [24]: mapped = rdd.map(lambda x: x)
    
    In [25]: type(mapped)
    Out[25]: pyspark.rdd.PipelinedRDD
    
    In [26]: mapped._is_pipelinable()
    Out[26]: True
    
    In [27]: p = mapped.cache()
    
    In [28]: type(p)
    Out[28]: pyspark.rdd.PersistedRDD
    
    In [29]: p._is_pipelinable()
    ---------------------------------------------------------------------------
    AttributeError                            Traceback (most recent call last)
    <ipython-input-29-02496125eccd> in <module>()
    ----> 1 p._is_pipelinable()
    
    AttributeError: 'PersistedRDD' object has no attribute '_is_pipelinable'
    
    In [30]: mapped2 = p.map(lambda x: x)
    
    In [31]: type(mapped2)
    Out[31]: pyspark.rdd.PipelinedRDD
    ```
    
    So I think chaining will work, but the pipelined RDD thinks mapped2 is the 
1st transformation, while it is actually the 2nd. I _think_ this will just be 
an efficiency issue rather than a correctness issue however. 
    
    We could possibly work around it with some type checking etc but it then 
starts to feel like adding more complexity than the feature is worth...



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark issue #14579: [SPARK-16921][PYSPARK] RDD/DataFrame persist()/cache() s...

Reply via email to