thanks Vadim. yes this is a good option for us. thanks
From: Vadim Semenov <vadim.seme...@datadoghq.com>
Sent: Wednesday, August 2, 2017 6:24:40 PM
To: Suzen, Mehmet
Cc: jeff saremi; user@spark.apache.org
Subject: Re: How can i remove the need for calling
So if you just save an RDD to HDFS via 'saveAsSequenceFile', you would have
to create a new RDD that reads that data, this way you'll avoid recomputing
the RDD but may lose time on saving/loading.
Exactly same thing happens in 'checkpoint', 'checkpoint' is just a
convenient method that gives you
On 3 August 2017 at 03:00, Vadim Semenov wrote:
> `saveAsObjectFile` doesn't save the DAG, it acts as a typical action, so it
> just saves data to some destination.
Yes, that's what I thought, so the statement "..otherwise saving it on
a file will require
`saveAsObjectFile` doesn't save the DAG, it acts as a typical action, so it
just saves data to some destination.
`cache/persist` allow you to cache data and keep the DAG in case of some
executor that holds data goes down, so Spark would still be able to
recalculate missing partitions
-
>> *From:* Vadim Semenov <vadim.seme...@datadoghq.com>
>> *Sent:* Tuesday, August 1, 2017 12:05:17 PM
>> *To:* jeff saremi
>> *Cc:* user@spark.apache.org
>> *Subject:* Re: How can i remove the need for calling cache
>>
>> You can use `.checkpoint()
q.com>
> *Sent:* Tuesday, August 1, 2017 12:05:17 PM
> *To:* jeff saremi
> *Cc:* user@spark.apache.org
> *Subject:* Re: How can i remove the need for calling cache
>
> You can use `.checkpoint()`:
> ```
> val sc: SparkContext
> sc.setCheckpointDir("hdfs:///tmp/checkpoint
On 3 August 2017 at 01:05, jeff saremi wrote:
> Vadim:
>
> This is from the Mastering Spark book:
>
> "It is strongly recommended that a checkpointed RDD is persisted in memory,
> otherwise saving it on a file will require recomputation."
Is this really true? I had the
as
hoping for
From: Vadim Semenov <vadim.seme...@datadoghq.com>
Sent: Tuesday, August 1, 2017 12:05:17 PM
To: jeff saremi
Cc: user@spark.apache.org
Subject: Re: How can i remove the need for calling cache
You can use `.checkpoint()`:
```
val sc: SparkContext
sc.setCheckpoint
Thanks Mark. I'll examine the status more carefully to observe this.
From: Mark Hamstra <m...@clearstorydata.com>
Sent: Tuesday, August 1, 2017 11:25:46 AM
To: user@spark.apache.org
Subject: Re: How can i remove the need for calling cache
Very likely
Thanks Vadim. I'll try that
From: Vadim Semenov <vadim.seme...@datadoghq.com>
Sent: Tuesday, August 1, 2017 12:05:17 PM
To: jeff saremi
Cc: user@spark.apache.org
Subject: Re: How can i remove the need for calling cache
You can use `.checkpoint()`:
```
You can use `.checkpoint()`:
```
val sc: SparkContext
sc.setCheckpointDir("hdfs:///tmp/checkpointDirectory")
myrdd.checkpoint()
val result1 = myrdd.map(op1(_))
result1.count() // Will save `myrdd` to HDFS and do map(op1…
val result2 = myrdd.map(op2(_))
result2.count() // Will load `myrdd` from
ubject: Re: How can i remove the need for calling cache
Hi Jeff, that looks sane to me. Do you have additional details?
On 1 August 2017 at 11:05, jeff saremi
<jeffsar...@hotmail.com<mailto:jeffsar...@hotmail.com>> wrote:
Calling cache/persist fails all our jobs (i have
Very likely, much of the potential duplication is already being avoided
even without calling cache/persist. When running the above code without
`myrdd.cache`, have you looked at the Spark web UI for the Jobs? For at
least one of them you will likely see that many Stages are marked as
"skipped",
Hi Jeff, that looks sane to me. Do you have additional details?
On 1 August 2017 at 11:05, jeff saremi wrote:
> Calling cache/persist fails all our jobs (i have posted 2 threads on
> this).
>
> And we're giving up hope in finding a solution.
> So I'd like to find a
Calling cache/persist fails all our jobs (i have posted 2 threads on this).
And we're giving up hope in finding a solution.
So I'd like to find a workaround for that:
If I save an RDD to hdfs and read it back, can I use it in more than one
operation?
Example: (using cache)
// do a whole bunch
15 matches
Mail list logo