Using hadoop-0.20.2+737 on Redhat's distribution. I'm trying to use a dictionary.csv file from a Lucene index inside a map function plus another comma delimited file.
It's just a simple loop of reading a line, split the line on commas, and add the dictionary entry to a hash map. It's about an 8M file with 1.5M lines. I'm using an absolute path so the file read is local (and not hdfs). I've verified no hdfs reads occurring from the job status. When I run this outside of hadoop it executes in 6 seconds. Inside hadoop it takes 13 seconds and the java process is 100% CPU the whole time... This makes absolutely no sense to me...I would've thought it should execute in the same time frame seeing as how it's just reading a local file (I'm only running one task at the moment). I'm also reading another file in a similar fashion and see 3.4 seconds vs 0.3 seconds (longer lines that are also getting split). This one is 45 lines and 278K. It appears that perhaps the split function is running slower since the smaller file with more columns runs 10X slower than the large file which is "only" 2X slower. Anybody have any idea why file input is slower under hadoop?
smime.p7s
Description: S/MIME cryptographic signature
