How does Nutch deal with calendars and such where the pages go endless? Does Nutch have a limit option of how deep the searches go on a particular domain?
Nutch does not yet explicitly deal with this, but I don't think it should be a problem. Typically one performs periodic link analysis on the database and then generates fetchlists containing only pages which have higher link-analysis scores (e.g., 'bin/nutch generate db segments -topN 100000' to generate the top-scoring 100,000 pages in the db). Calendar pages are unlikely to have high link analysis scores and should not be fetched. Even if the first calendar page (e.g., the current month) has a high score, the score will diminish in links to subsequent months and eventually be too low to trigger fetching.
However, such pages could be a problem if one does not use the -topN option when generating pages.
Doug
------------------------------------------------------- This SF.Net email is sponsored by: IBM Linux Tutorials Free Linux tutorial presented by Daniel Robbins, President and CEO of GenToo technologies. Learn everything from fundamentals to system administration.http://ads.osdn.com/?ad_id=1470&alloc_id=3638&op=click _______________________________________________ Nutch-general mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-general
