Sami Siren wrote: > Andrzej Bialecki (JIRA) wrote: > >> [ >> http://issues.apache.org/jira/browse/NUTCH-293?page=comments#action_12422244 >> ] Andrzej Bialecki commented on NUTCH-293: >> ----------------------------------------- >> >> I'm working on this patch to commit it. Just a quick note to Sami: >> Math.max() is not optimal, because it always picks up the longest >> wait period. We are interested in getting a right period - it may be >> longer, but it may also be shorter than the serverDelay. If it's >> shorter then we win, because we are allowed to crawl this site faster. >> >> >> > I quess it depends on the angle you look at it :) > "don't be polite, just as polite as it's required" > > I'm ok with the original logic.
Hmm. Let me try another explanation. When you crawl, you _are_ interested in getting all pages as quickly as possible, right? Then, you want to observe the minimum level of "politeness" per site, as specified by webmasters and netiquette, and not the maximum level of politeness. If a site allows you to crawl it with 5 sec delay, then you won't be impolite if you do that, even though you apply 20 sec. delay for all other sites - and you will reach your goal much quicker. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys -- and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
