Also check if the compression algorithm you use is splittable?

Thanks,
Sonal
Nube Technologies <http://www.nubetech.co>

<http://in.linkedin.com/in/sonalgoyal>



On Thu, Sep 28, 2017 at 2:17 PM, Tejeshwar J1 <
tejeshwar...@globallogic.com.invalid> wrote:

> Hi Miller,
>
>
>
> Try using
>
> 1.*coalesce(numberOfPartitions*) to reduce the number of partitions in
> order to avoid idle cores .
>
> 2.Try reducing executor memory as you increase the number of executors.
>
> 3. Try performing GC or changing naïve java serialization to *kryo*
> serialization.
>
>
>
>
>
> Thanks,
>
> Tejeshwar
>
>
>
>
>
> *From:* Jeroen Miller [mailto:bluedasya...@gmail.com]
> *Sent:* Thursday, September 28, 2017 2:11 PM
> *To:* user@spark.apache.org
> *Subject:* More instances = slower Spark job
>
>
>
> Hello,
>
>
>
> I am experiencing a disappointing performance issue with my Spark jobs
>
> as I scale up the number of instances.
>
>
>
> The task is trivial: I am loading large (compressed) text files from S3,
>
> filtering out lines that do not match a regex, counting the numbers
>
> of remaining lines and saving the resulting datasets as (compressed)
>
> text files on S3. Nothing that a simple grep couldn't do, except that
>
> the files are too large to be downloaded and processed locally.
>
>
>
> On a single instance, I can process X GBs per hour. When scaling up
>
> to 10 instances, I noticed that processing the /same/ amount of data
>
> actually takes /longer/.
>
>
>
> This is quite surprising as the task is really simple: I was expecting
>
> a significant speed-up. My naive idea was that each executors would
>
> process a fraction of the input file, count the remaining lines /locally/,
>
> and save their part of the processed file /independently/, thus no data
>
> shuffling would occur.
>
>
>
> Obviously, this is not what is happening.
>
>
>
> Can anyone shed some light on this or provide pointers to relevant
>
> information?
>
>
>
> Regards,
>
>
>
> Jeroen
>
>
>

Reply via email to