Hi,
Beside caching, is it possible if an RDD has multiple child RDDs? So I can
read the input one and  produce multiple outputs for multiple jobs which
share the input.
On May 5, 2015 6:24 PM, "Evan R. Sparks" <evan.spa...@gmail.com> wrote:

> Scan sharing can indeed be a useful optimization in spark, because you
> amortize not only the time spent scanning over the data, but also time
> spent in task launch and scheduling overheads.
>
> Here's a trivial example in scala. I'm not aware of a place in SparkSQL
> where this is used - I'd imagine that most development effort is being
> placed on single-query optimization right now.
>
> //This function takes a sequence of functions of type A => B and returns a
> function of A => Seq[B] where each item in the input list corresponds to a
> def combineFunctions[A,B](fns: Seq[A=>B]): A => Seq[B] = {
>    def combf(a: A): Seq[B] = {
>      fns.map(f => f(a))
>    }
>   combf
> }
>
> def plusOne(x: Int) = x + 1
> def timesFive(x: Int) = x * 5
>
> val sharedF = combineFunctions(Seq[Int => Int](plusOne, timesFive))
>
> val data = sc.parallelize(Array(1,2,3,4,5,6,7))
>
> //Apply this combine function to each of your data elements.
> val res = data.map(sharedF)
>
> res.take(5)
>
> The result will look something like this.
>
> res5: Array[Seq[Int]] = Array(List(2, 5), List(3, 10), List(4, 15),
> List(5, 20), List(6, 25))
>
>
>
> On Tue, May 5, 2015 at 8:53 AM, Quang-Nhat HOANG-XUAN <
> hxquangn...@gmail.com> wrote:
>
>> Hi everyone,
>>
>> I have two Spark jobs inside a Spark Application, which read from the same
>> input file.
>> They are executed in 2 threads.
>>
>> Right now, I cache the input file into memory before executing these two
>> jobs.
>>
>> Are there another ways to share their same input with just only one read?
>> I know there is something called Multiple Query Optimization, but I don't
>> know if it can be applicable on Spark (or SparkSQL) or not?
>>
>> Thank you.
>>
>> Quang-Nhat
>>
>
>

Reply via email to