Whats your spark-submit commands in both cases? Is it Spark Standalone or YARN (both support client and cluster)? Accordingly what is the number of executors/cores requested?
TD On Wed, Dec 31, 2014 at 10:36 AM, Enno Shioji <eshi...@gmail.com> wrote: > Also the job was deployed from the master machine in the cluster. > ᐧ > > On Wed, Dec 31, 2014 at 6:35 PM, Enno Shioji <eshi...@gmail.com> wrote: > >> Oh sorry that was a edit mistake. The code is essentially: >> >> val msgStream = kafkaStream >> .map { case (k, v) => v} >> .map(DatatypeConverter.printBase64Binary) >> .saveAsTextFile("s3n://some.bucket/path", classOf[LzoCodec]) >> >> I.e. there is essentially no original code (I was calling saveAsTextFile >> in a "save" function but that was just a remnant from previous debugging). >> >> >> ᐧ >> >> On Wed, Dec 31, 2014 at 6:21 PM, Sean Owen <so...@cloudera.com> wrote: >> >>> -dev, +user >>> >>> A decent guess: Does your 'save' function entail collecting data back >>> to the driver? and are you running this from a machine that's not in >>> your Spark cluster? Then in client mode you're shipping data back to a >>> less-nearby machine, compared to with cluster mode. That could explain >>> the bottleneck. >>> >>> On Wed, Dec 31, 2014 at 4:12 PM, Enno Shioji <eshi...@gmail.com> wrote: >>> > Hi, >>> > >>> > I have a very, very simple streaming job. When I deploy this on the >>> exact >>> > same cluster, with the exact same parameters, I see big (40%) >>> performance >>> > difference between "client" and "cluster" deployment mode. This seems >>> a bit >>> > surprising.. Is this expected? >>> > >>> > The streaming job is: >>> > >>> > val msgStream = kafkaStream >>> > .map { case (k, v) => v} >>> > .map(DatatypeConverter.printBase64Binary) >>> > .foreachRDD(save) >>> > .saveAsTextFile("s3n://some.bucket/path", classOf[LzoCodec]) >>> > >>> > I tried several times, but the job deployed with "client" mode can only >>> > write at 60% throughput of the job deployed with "cluster" mode and >>> this >>> > happens consistently. I'm logging at INFO level, but my application >>> code >>> > doesn't log anything so it's only Spark logs. The logs I see in >>> "client" >>> > mode doesn't seem like a crazy amount. >>> > >>> > The setup is: >>> > spark-ec2 [...] \ >>> > --copy-aws-credentials \ >>> > --instance-type=m3.2xlarge \ >>> > -s 2 launch test_cluster >>> > >>> > And all the deployment was done from the master machine. >>> > >>> > ᐧ >>> >> >> >