Hey again,

When you split the segments and do massive crawls, is there any faster way
to update the database?  Right now it is taking about 6 hours to update
about 600,000 pages.  I did a crawl of 5 million pages and split that into
several segments, but the biggest issue is processing time.  Would updating
the database via NFS and have several machines update mess up the database
or would it even work?

Also on recrawls, after our allotted amount of time that we had the crawler
go out and refetch a newer page, do I go and delete the segments they were
originally in?  What is the best way to make sure I don't have a ton of old
data that isn't even being used?  Can you explain how re-crawls work?

And last but not least, is it ok to index a segment even though it hasn't
been updated in the database.  I am trying to get the amount of time for
each section down and hope several machines can share the load on indexing /
updating the db.

Thanks again for the awesome help.

I will buy the book :)

J



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click
_______________________________________________
Nutch-general mailing list
[EMAIL PROTECTED]
https://lists.sourceforge.net/lists/listinfo/nutch-general

Reply via email to