as a side question, is it possible to control where tmp files are stored ?
I bumped recently on a FS full situation so I whish to move those files to a
bigger partition.

-Ray-

2009/5/11 Andrzej Bialecki <[email protected]>

> ravi jagan wrote:
>
>> Cluster Summary
>>
>> I am running a crawl on about 1 Million web domains. After 30% Map is done
>> I
>> see the following usage
>> The Non DFS uses seems very high like 31G. This means nutch is creating
>> too
>> many temporary files local
>> to that node. Is this correct ? Hoping someone will answer this post with
>> at
>> least a Ok/not Ok.
>> First Crawl on the hadoop.  No other jobs running. DFS had 10G of data
>> before this job started
>>
>> 314 files and directories, 460 blocks = 774 total. Heap Size is 14.82 MB /
>> 966.69 MB (1%)  Configured Capacity    :       377.91 GB
>>  DFS Used        :       60.31 GB
>>  Non DFS Used    :       31.58 GB
>>  DFS Remaining   :       286.02 GB
>>  DFS Used%       :       15.96 %
>>  DFS Remaining%  :       75.69 %
>>  Live Nodes      :       8
>>  Dead Nodes      :       0
>>
>>
>>
> It depends on how much content you are downloading. The 31GB that you see
> at 300,000 pages ... this gives ~100kB per page, which is a relatively large
> number. Is this with parsing turned on or off? If you are crawling with
> unlimited size then it's possible that large documents (such as PDFs and
> Office documents) inflate this number.
>
> The issue of non-DFS vs. DFS usage - this is normal. Intermediate job
> output is stored locally, so only during reduce phase you will start seeing
> DFS usage climbing up, and non-DFS usage decreasing progressively as reduce
> tasks complete their job.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Reply via email to