Hey again, When you split the segments and do massive crawls, is there any faster way to update the database? Right now it is taking about 6 hours to update about 600,000 pages. I did a crawl of 5 million pages and split that into several segments, but the biggest issue is processing time. Would updating the database via NFS and have several machines update mess up the database or would it even work?
Also on recrawls, after our allotted amount of time that we had the crawler go out and refetch a newer page, do I go and delete the segments they were originally in? What is the best way to make sure I don't have a ton of old data that isn't even being used? Can you explain how re-crawls work? And last but not least, is it ok to index a segment even though it hasn't been updated in the database. I am trying to get the amount of time for each section down and hope several machines can share the load on indexing / updating the db. Thanks again for the awesome help. I will buy the book :) J ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general
