Andrzej Bialecki wrote:
The reason is that if you pack this file into your job JAR, the job jar would become very large (presumably this 40MB is already compressed?). Job jar needs to be copied to each tasktracker for each task, so you will experience performance hit just because of the size of the job jar ... whereas if this file sits on DFS and is highly replicated, its content will always be available locally.
Note that the job jar is copied into HDFS with a highish replication (10?), and that it is only copied to each tasktracker node once per *job*, not per task. So it's only faster to manage this yourself if you have a sequence of jobs that share this data, and if the time to re-replicate it per job is significant.
Doug