Thanks for all the responses. I observed one more related issue
Last month before this post, I noticed that if hadoop runs out of tmp space, nutch goes into a very long loop in the reduce phase of fetch. Dont have exact string. basically it says on node A, available space = 4G, expected usage = 8G and then the same message for node B, C, D.. and back to A. Probably it will try a certain number of times before giving up Havent checked the code, but I felt it is a infinite loop. and In my case. I killed the nutch process using hadoop kill. I looks like this leaves a massive amount of tmp space in use by hadoop forever. More details here http://www.nabble.com/intermediate-files-of-killed-tasks-not-purged-td23271289.html ravi jagan wrote: > > Cluster Summary > > I am running a crawl on about 1 Million web domains. After 30% Map is done > I see the following usage > The Non DFS uses seems very high like 31G. This means nutch is creating > too many temporary files local > to that node. Is this correct ? Hoping someone will answer this post with > at least a Ok/not Ok. > First Crawl on the hadoop. No other jobs running. DFS had 10G of data > before this job started > > 314 files and directories, 460 blocks = 774 total. Heap Size is 14.82 MB / > 966.69 MB (1%) > Configured Capacity : 377.91 GB > DFS Used : 60.31 GB > Non DFS Used : 31.58 GB > DFS Remaining : 286.02 GB > DFS Used% : 15.96 % > DFS Remaining% : 75.69 % > Live Nodes : 8 > Dead Nodes : 0 > > > -- View this message in context: http://www.nabble.com/Nutch1.0-hadoop-dfs-usage-doesnt-seem-right-.-experience-users-please-comment-tp23454975p23488757.html Sent from the Nutch - User mailing list archive at Nabble.com.
