You are saying the RDD lineage must be serialized, otherwise we could not
recreate it after a node failure. This is false. The RDD lineage is not
serialized. It is only relevant to the driver application and as such it is
just kept in memory in the driver application. If the driver application
stops, the lineage is lost. There is no recovery.

On Wed, Aug 24, 2016 at 10:20 AM, kant kodali <kanth...@gmail.com> wrote:

> can you please elaborate a bit more?
>
>
>
> On Wed, Aug 24, 2016 12:41 AM, Sean Owen so...@cloudera.com wrote:
>
>> Byte code, no. It's sufficient to store the information that the RDD
>> represents, which can include serialized function closures, but that's not
>> quite storing byte code.
>>
>> On Wed, Aug 24, 2016 at 2:00 AM, kant kodali <kanth...@gmail.com> wrote:
>>
>> Hi Guys,
>>
>> I have this question for a very long time and after diving into the
>> source code(specifically from the links below) I have a feeling that the
>> lineage of an RDD (the transformations) are converted into byte code and
>> stored in memory or disk. or if I were to ask another question on a similar
>> note do we ever store JVM byte code or python byte code in memory or disk?
>> This make sense to me because if we were to construct an RDD after a node
>> failure we need to go through the lineage and execute the respective
>> transformations so storing their byte codes does make sense however many
>> people seem to disagree with me so it would be great if someone can clarify.
>>
>> https://github.com/apache/spark/blob/6ee40d2cc5f467c78be662c
>> 1639fc3d5b7f796cf/python/pyspark/rdd.py#L1452
>>
>> https://github.com/apache/spark/blob/6ee40d2cc5f467c78be662c
>> 1639fc3d5b7f796cf/python/pyspark/rdd.py#L1471
>>
>> https://github.com/apache/spark/blob/6ee40d2cc5f467c78be662c
>> 1639fc3d5b7f796cf/python/pyspark/rdd.py#L229
>>
>> https://github.com/apache/spark/blob/master/python/pyspark/
>> cloudpickle.py#L241
>>
>>
>>

Reply via email to