Github user MLnick commented on the issue:
https://github.com/apache/spark/pull/14579
yeah it would break pipelining - I don't think it will necessarily throw an
error though.
e.g.
```
In [22]: rdd = sc.parallelize(["b", "a", "c"])
In [23]: type(rdd)
Out[23]: pyspark.rdd.RDD
In [24]: mapped = rdd.map(lambda x: x)
In [25]: type(mapped)
Out[25]: pyspark.rdd.PipelinedRDD
In [26]: mapped._is_pipelinable()
Out[26]: True
In [27]: p = mapped.cache()
In [28]: type(p)
Out[28]: pyspark.rdd.PersistedRDD
In [29]: p._is_pipelinable()
---------------------------------------------------------------------------
AttributeError Traceback (most recent call last)
<ipython-input-29-02496125eccd> in <module>()
----> 1 p._is_pipelinable()
AttributeError: 'PersistedRDD' object has no attribute '_is_pipelinable'
In [30]: mapped2 = p.map(lambda x: x)
In [31]: type(mapped2)
Out[31]: pyspark.rdd.PipelinedRDD
```
So I think chaining will work, but the pipelined RDD thinks mapped2 is the
1st transformation, while it is actually the 2nd. I _think_ this will just be
an efficiency issue rather than a correctness issue however.
We could possibly work around it with some type checking etc but it then
starts to feel like adding more complexity than the feature is worth...
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]