Also check if the compression algorithm you use is splittable? Thanks, Sonal Nube Technologies <http://www.nubetech.co>
<http://in.linkedin.com/in/sonalgoyal> On Thu, Sep 28, 2017 at 2:17 PM, Tejeshwar J1 < tejeshwar...@globallogic.com.invalid> wrote: > Hi Miller, > > > > Try using > > 1.*coalesce(numberOfPartitions*) to reduce the number of partitions in > order to avoid idle cores . > > 2.Try reducing executor memory as you increase the number of executors. > > 3. Try performing GC or changing naïve java serialization to *kryo* > serialization. > > > > > > Thanks, > > Tejeshwar > > > > > > *From:* Jeroen Miller [mailto:bluedasya...@gmail.com] > *Sent:* Thursday, September 28, 2017 2:11 PM > *To:* user@spark.apache.org > *Subject:* More instances = slower Spark job > > > > Hello, > > > > I am experiencing a disappointing performance issue with my Spark jobs > > as I scale up the number of instances. > > > > The task is trivial: I am loading large (compressed) text files from S3, > > filtering out lines that do not match a regex, counting the numbers > > of remaining lines and saving the resulting datasets as (compressed) > > text files on S3. Nothing that a simple grep couldn't do, except that > > the files are too large to be downloaded and processed locally. > > > > On a single instance, I can process X GBs per hour. When scaling up > > to 10 instances, I noticed that processing the /same/ amount of data > > actually takes /longer/. > > > > This is quite surprising as the task is really simple: I was expecting > > a significant speed-up. My naive idea was that each executors would > > process a fraction of the input file, count the remaining lines /locally/, > > and save their part of the processed file /independently/, thus no data > > shuffling would occur. > > > > Obviously, this is not what is happening. > > > > Can anyone shed some light on this or provide pointers to relevant > > information? > > > > Regards, > > > > Jeroen > > >