The db update works by sorting updates then merging them into a new copy of the database. So it requires space proportional to roughly twice the number of updates (typically dominated by the number of outgoing links found in the crawl) while sorting, and twice the final database size while merging. So less disk space will be used if you perform smaller updates, with fewer and/or smaller segments, but you updates will then take longer overall.

The web database will eventually remove URLs that cannot be fetched, and removes pages that are not linked to by other pages.

Doug

Matthias Jaekle wrote:
Hi,

for analyzing the webdb it seems to be, that there is much free hdd on my system necessary. Analyzing the webdb uses 6 times the hdd space then the own size of the webdb.

I am running just a small nutch system with a 80 GB hard disk. There I have around 25 GB segments, 3 GB index and 6 GB webdb. Together with the OS and 30 GB I have to keep free for analyzing the webdb, the hdd is full.

Any possibility to reduce the amount of space I have to keep free or do I make something wrong?

Is the webdb a always growing system or is it useful and possible to delete unimportant urls?

Many thanks for your answers.

Matthias Jaekle


-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers


-------------------------------------------------------
This SF.Net email is sponsored by: Oracle 10g
Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE.
http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click
_______________________________________________
Nutch-developers mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to