thanks JaeBoo, in our case, the shuffle write are similar. 2015-01-21 17:01 GMT+08:00 JaeBoo Jung <itsjb.j...@samsung.com>:
> I was recently faced with a similar issue, but unfortunately I could > not find out why it happened. > > Here's jira ticket https://issues.apache.org/jira/browse/SPARK-5081 of my > previous post. > > Please check your shuffle I/O differences between the two in spark web UI > because it can be possibly related to my case. > > > > Thanks > > Kevin > > > > ------- *Original Message* ------- > > *Sender* : Fengyun RAO<raofeng...@gmail.com> > > *Date* : 2015-01-21 17:41 (GMT+09:00) > > *Title* : Re: spark 1.2 three times slower than spark 1.1, why? > > > > maybe you mean different spark-submit script? > > we also use the same spark-submit script, thus the same memory, cores, > etc configuration. > > > 2015-01-21 15:45 GMT+08:00 Sean Owen <so...@cloudera.com>: > >> I don't know of any reason to think the singleton pattern doesn't work or >> works differently. I wonder if, for example, task scheduling is different >> in 1.2 and you have more partitions across more workers and so are loading >> more copies more slowly into your singletons. >> On Jan 21, 2015 7:13 AM, "Fengyun RAO" <raofeng...@gmail.com> wrote: >> >>> the LogParser instance is not serializable, and thus cannot be a >>> broadcast, >>> >>> what’s worse, it contains an LRU cache, which is essential to the >>> performance, and we would like to share among all the tasks on the same >>> node. >>> >>> If it is the case, what’s the recommended way to share a variable among >>> all the tasks within the same executor. >>> >>> >>> 2015-01-21 15:04 GMT+08:00 Davies Liu <dav...@databricks.com>: >>> >>>> Maybe some change related to serialize the closure cause LogParser is >>>> not a singleton any more, then it is initialized for every task. >>>> >>>> Could you change it to a Broadcast? >>>> >>>> On Tue, Jan 20, 2015 at 10:39 PM, Fengyun RAO <raofeng...@gmail.com> >>>> wrote: >>>> > Currently we are migrating from spark 1.1 to spark 1.2, but found the >>>> > program 3x slower, with nothing else changed. >>>> > note: our program in spark 1.1 has successfully processed a whole >>>> year data, >>>> > quite stable. >>>> > >>>> > the main script is as below >>>> > >>>> > sc.textFile(inputPath) >>>> > .flatMap(line => LogParser.parseLine(line)) >>>> > .groupByKey(new HashPartitioner(numPartitions)) >>>> > .mapPartitionsWithIndex(...) >>>> > .foreach(_ => {}) >>>> > >>>> > where LogParser is a singleton which may take some time to >>>> initialized and >>>> > is shared across the execuator. >>>> > >>>> > the flatMap stage is 3x slower. >>>> > >>>> > We tried to change spark.shuffle.manager back to hash, and >>>> > spark.shuffle.blockTransferService back to nio, but didn’t help. >>>> > >>>> > May somebody explain possible causes, or what should we test or >>>> change to >>>> > find it out >>>> >>> >>> >