Hi,

no matter what you do and how many nodes you start, in case you have a
single text file, it will not use parallelism.

Therefore there are options of transferring the textfile to parquet, and
other formats, or just splitting the text file itself into several
individual files.

Please do let me know how things are progressing.

Regards,
Gourav Sengupita

On Thu, Sep 28, 2017 at 9:41 AM, Jeroen Miller <bluedasya...@gmail.com>
wrote:

> Hello,
>
> I am experiencing a disappointing performance issue with my Spark jobs
> as I scale up the number of instances.
>
> The task is trivial: I am loading large (compressed) text files from S3,
> filtering out lines that do not match a regex, counting the numbers
> of remaining lines and saving the resulting datasets as (compressed)
> text files on S3. Nothing that a simple grep couldn't do, except that
> the files are too large to be downloaded and processed locally.
>
> On a single instance, I can process X GBs per hour. When scaling up
> to 10 instances, I noticed that processing the /same/ amount of data
> actually takes /longer/.
>
> This is quite surprising as the task is really simple: I was expecting
> a significant speed-up. My naive idea was that each executors would
> process a fraction of the input file, count the remaining lines /locally/,
> and save their part of the processed file /independently/, thus no data
> shuffling would occur.
>
> Obviously, this is not what is happening.
>
> Can anyone shed some light on this or provide pointers to relevant
> information?
>
> Regards,
>
> Jeroen
>
>

Reply via email to