as a side question, is it possible to control where tmp files are stored ? I bumped recently on a FS full situation so I whish to move those files to a bigger partition.
-Ray- 2009/5/11 Andrzej Bialecki <[email protected]> > ravi jagan wrote: > >> Cluster Summary >> >> I am running a crawl on about 1 Million web domains. After 30% Map is done >> I >> see the following usage >> The Non DFS uses seems very high like 31G. This means nutch is creating >> too >> many temporary files local >> to that node. Is this correct ? Hoping someone will answer this post with >> at >> least a Ok/not Ok. >> First Crawl on the hadoop. No other jobs running. DFS had 10G of data >> before this job started >> >> 314 files and directories, 460 blocks = 774 total. Heap Size is 14.82 MB / >> 966.69 MB (1%) Configured Capacity : 377.91 GB >> DFS Used : 60.31 GB >> Non DFS Used : 31.58 GB >> DFS Remaining : 286.02 GB >> DFS Used% : 15.96 % >> DFS Remaining% : 75.69 % >> Live Nodes : 8 >> Dead Nodes : 0 >> >> >> > It depends on how much content you are downloading. The 31GB that you see > at 300,000 pages ... this gives ~100kB per page, which is a relatively large > number. Is this with parsing turned on or off? If you are crawling with > unlimited size then it's possible that large documents (such as PDFs and > Office documents) inflate this number. > > The issue of non-DFS vs. DFS usage - this is normal. Intermediate job > output is stored locally, so only during reduce phase you will start seeing > DFS usage climbing up, and non-DFS usage decreasing progressively as reduce > tasks complete their job. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > >
