Thanks for all the responses.  I observed one more related issue

Last month before this post, I noticed that if hadoop runs out of tmp space,
nutch goes into a 
very long loop in the reduce phase of fetch. Dont have exact string.
basically it says
on node A, available space = 4G, expected usage = 8G and then the same
message for 
node B, C, D.. and back to A.  Probably it will try a certain number of
times before giving up
Havent checked the code, but I felt it is a infinite loop. and

In my case. I killed the nutch process  using hadoop kill.   I looks like
this leaves a massive amount
of tmp space in use by hadoop forever. 

More details here
http://www.nabble.com/intermediate-files-of-killed-tasks-not-purged-td23271289.html



ravi jagan wrote:
> 
> Cluster Summary
> 
> I am running a crawl on about 1 Million web domains. After 30% Map is done
> I see the following usage
> The Non DFS uses seems very high like 31G. This means nutch is creating
> too many temporary files local
> to that node. Is this correct ? Hoping someone will answer this post with
> at least a Ok/not Ok.
> First Crawl on the hadoop.  No other jobs running. DFS had 10G of data
> before this job started
> 
> 314 files and directories, 460 blocks = 774 total. Heap Size is 14.82 MB /
> 966.69 MB (1%) 
>   Configured Capacity  :       377.91 GB
>  DFS Used      :       60.31 GB
>  Non DFS Used  :       31.58 GB
>  DFS Remaining         :       286.02 GB
>  DFS Used%     :       15.96 %
>  DFS Remaining%        :       75.69 %
>  Live Nodes    :       8
>  Dead Nodes    :       0
> 
> 
> 

-- 
View this message in context: 
http://www.nabble.com/Nutch1.0-hadoop-dfs-usage-doesnt-seem-right-.-experience-users-please-comment-tp23454975p23488757.html
Sent from the Nutch - User mailing list archive at Nabble.com.

Reply via email to