Hello, I trying to make my web crawling go faster with hadoop. My mapper just consists of a few lines and my reducer is an IdentityReducer
while read line;do #result="`wget -O - --timeout=500 http://$line 2>&1`" echo $result done I am crawling about 50,000 sites. But my mapper always seems to time out after sometime. The crawler just becomes unresponsive I guess. I am not able to see which site is causing the problem as mapper deletes the output if the job fails. I am running a single node hadoop cluster currently. Is this the problem ? Did anyone else have a similar problem ? I am not sure why this is happening. Can I prevent mapper from deleting intermediate outputs ? I tried running mapper against 10-20 sites as opposed to 50k sites and that worked fine. Thanks, Aishwarya