I have a job which takes an xml file - the splitter breaks the file into tags, the mapper parses each tag and sends the data to the reducer. I am using a custom splitter which reads the file looking for start and end tags.
When I run the code in the splitter and the mapper - generating separate tags and parsing them I can read a file sized at about 500MB containing 12000 tags on my local system in 23 seconds When I read a file on HDFS on a local cluster I can read and parse the file in 38 seconds When I run the same code on a eight node cluster I get 7 map tasks. The mappers are taking 190 seconds to handle 100 tags of which 200 millisec is parsing and almost all of the rest of the time is in context.write. A mapper handling 1600 tags takes about 3 hours - These are the statistics for a map task - it it true that one tag well be sent to about 300 keys but still 3 hours to write 1,5 million records and 5Gb seems way excessive *FileSystemCounters* FILE_BYTES_READ 816,935,457 HDFS_BYTES_READ 439,554,860 FILE_BYTES_WRITTEN 1,667,745,197 *Performance* TotalScoredScans 1,660 *Map-Reduce Framework* Combine output records0Map input records 6,134 Spilled Records 1,690,063 Map output bytes 5,517,423,780 Combine input records 0 Map output records 571,475 Anyone want to offer suggestions on how to tune the job better -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com