Hi Aishwarya
To debug this issue you necessarily don't need the intermediate output.
If there is any error/exception then you can get it from your job logs
directly. In your case the job turns irresponsive, to do further trouble
shooting you can include log statements on your program and then rerun the
same and obtain the records that creates the problem from your logs.
In a direct manner you can obtain your logs from the job tracker web UI.
http://<host>:50030/jobtracker.jsp. From your job drill down to the task and on
the right side you can see options to display your task tracker logs.
On top of this i'd like to add on, since you mentioned single node, I
assume it is either on stand alone/distributed mode. These setup is basically
for development and testing of functionality. If you are looking for better
performance of your jobs, you need to leverage the parallel processing power
of hadoop. You need to have a mini cluster at least for performance bench
marking and processing relatively large volume data.
Hope it helps!..
------Original Message------
From: Aishwarya Venkataraman
Sender: [email protected]
To: [email protected]
ReplyTo: [email protected]
Subject: Web crawler in hadoop - unresponsive after a while
Sent: Oct 14, 2011 08:20
Hello,
I trying to make my web crawling go faster with hadoop. My mapper just
consists of a single line and my reducer is an IdentityReducer
while read line;do
#result="`wget -O - --timeout=500 http://$line 2>&1`"
echo $result
done
I am crawling about 50,000 sites. But my mapper always seems to time out
after sometime. The crawler just becomes unresponsive I guess.
I am not able to see which site is causing the problem as mapper deletes the
output if the job fails. I am running a single node hadoop cluster
currently.
Is this the problem ?
Did anyone else have a similar problem ? I am not sure why this is
happening. Can I prevent mapper from deleting intermediate outputs ?
I tried running mapper against 10-20 sites as opposed to 50k sites and that
worked fine.
Thanks,
Aishwarya
Regards
Bejoy K S