I have a job which takes an xml file - the splitter breaks the file into
tags, the mapper parses each tag and sends the data to the
reducer. I am using a custom splitter which reads the file looking for
start and end tags.

When I run the code in the splitter and the mapper - generating separate
tags and parsing them
I can read a file sized at about  500MB containing 12000 tags on my local
system in 23 seconds

When I read a file on HDFS on a local cluster I can read and parse the file
in 38 seconds

When I run the same code on a eight node cluster I get 7 map tasks. The
mappers are taking 190 seconds to handle 100 tags of
which 200 millisec is parsing and almost all of the rest of the time is
in context.write. A mapper handling 1600 tags takes about 3 hours -
These are the statistics for a map task - it it true that one tag well be
sent to about 300 keys but still 3 hours to write 1,5 million records and
5Gb seems
way excessive

*FileSystemCounters*
FILE_BYTES_READ 816,935,457
HDFS_BYTES_READ 439,554,860
FILE_BYTES_WRITTEN 1,667,745,197
*Performance*
TotalScoredScans 1,660
*Map-Reduce Framework*
Combine output records0Map input records 6,134
Spilled Records 1,690,063
Map output bytes 5,517,423,780
Combine input records 0


Map output records 571,475

Anyone want to offer suggestions on how to tune the job better

-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Reply via email to