Hello, I am experiencing a disappointing performance issue with my Spark jobs as I scale up the number of instances.
The task is trivial: I am loading large (compressed) text files from S3, filtering out lines that do not match a regex, counting the numbers of remaining lines and saving the resulting datasets as (compressed) text files on S3. Nothing that a simple grep couldn't do, except that the files are too large to be downloaded and processed locally. On a single instance, I can process X GBs per hour. When scaling up to 10 instances, I noticed that processing the /same/ amount of data actually takes /longer/. This is quite surprising as the task is really simple: I was expecting a significant speed-up. My naive idea was that each executors would process a fraction of the input file, count the remaining lines /locally/, and save their part of the processed file /independently/, thus no data shuffling would occur. Obviously, this is not what is happening. Can anyone shed some light on this or provide pointers to relevant information? Regards, Jeroen