Hi Tom,

HDFS and Spark don't actually have a minimum block size -- so in that first 
dataset, the files won't each be costing you 64 MB. However, the main reason 
for difference in performance here is probably the number of RDD partitions. In 
the first case, Spark will create an RDD with 10000 partitions, one per file, 
while in the second case it will likely have only 1-2 of them. The number of 
partitions affects the level of parallelism of operations like reduceByKey (by 
default, reduceByKey uses as many partitions as the parent RDD it runs on), and 
in this case, I think it's causing reduceByKey to spill to disk within tasks in 
the second case and not in the first case, because each task has more input 
data.

You can see the number of partitions in each stage of your computation on the 
application web UI at http://<driver>:4040 if you want to confirm this. Also, 
for both programs, you can manually set the number of partitions for the reduce 
by passing a second argument to reduceByKey (e.g. reduceByKey(myFunc, 100)). In 
general it's best to choose this so that the input data for each reduce task 
fits in memory to avoid spilling.

Matei

On Oct 5, 2014, at 1:58 PM, Tom Hubregtsen <thubregt...@gmail.com> wrote:

> Hi,
> 
> I ran the same version of a program with two different types of input
> containing equivalent information. 
> Program 1: 10,000 files with on average 50 IDs, one every line
> Program 2: 1 file containing 10,000 lines. On average 50 IDs per line
> 
> My program takes the input, creates key/value pairs of them, and performs
> about 7 more steps. I have compared the sets of key/value pairs after the
> initialization phase, and they are equivalent. All other steps are equal.
> The only difference is using wholeTextFile versus textFile and the
> initialization input map function itself.  
> 
> Since I am reading in from HDFS, and the minimum part size there is 64MB,
> every file in program 1 will take 64MB, even though they are KBs original.
> Because of this, I expected program 2 to be faster, or equivalent as the
> initialization phase may be neglectable. 
> 
> In my first comparison, I provided the program with sufficient memory, and
> they both take around 2 minutes. No surprises here.
> 
> In my second comparison, I limit the memory to in this case 4 GB. Program 1
> executes in a little over 4 minutes, but program 2 takes over 15 minutes
> (after which I terminated the program, as I see it is getting there, but
> spilling massively in every stage). The difference between the two runs is
> the amount of spilling in phases later on in the program (*not* in the
> initialization phase). Program 1 spills 2 chunks per stage:
> 14/10/05 14:52:30 INFO ExternalAppendOnlyMap: Thread 241 spilling in-memory
> map of 327 MB to disk (1 time so far)
> 14/10/05 14:52:30 INFO ExternalAppendOnlyMap: Thread 242 spilling in-memory
> map of 328 MB to disk (1 time so far)
> Program 2 also spills these 2 chunks:
> 14/10/05 14:44:35 INFO ExternalAppendOnlyMap: Thread 240 spilling in-memory
> map of 320 MB to disk (1 time so far)
> 14/10/05 14:44:35 INFO ExternalAppendOnlyMap: Thread 241 spilling in-memory
> map of 323 MB to disk (1 time so far)
> But then spills 2 * ~15,000 chuncks of <1MB:
> 14/10/05 14:44:41 INFO ExternalAppendOnlyMap: Thread 240 spilling in-memory
> map of 0 MB to disk (15561 time so far)
> 14/10/05 14:44:41 INFO ExternalAppendOnlyMap: Thread 240 spilling in-memory
> map of 0 MB to disk (13866 times so far)
> 
> I understand that RDDs with narrow dependencies to their parents are
> pipelined, and form a stage. These all point to the parent RDD. This parent
> RDD will have different partitioning and different data placement between
> the two programs. Because of this, I did expect a difference in this stage,
> but as we change from JavaRDD to JavaPairRDD and do a reduceByKey after the
> initialization, I expected this difference to be gone in the next stages.
> Can anyone explain this behavior, or point me in a direction?
> 
> Thanks in advance!
> 
> 
> 
> 
> --
> View this message in context: 
> http://apache-spark-developers-list.1001551.n3.nabble.com/Impact-of-input-format-on-timing-tp8655.html
> Sent from the Apache Spark Developers List mailing list archive at Nabble.com.
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
> For additional commands, e-mail: dev-h...@spark.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to