[ https://issues.apache.org/jira/browse/NUTCH-381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12482266 ]
Andrzej Bialecki commented on NUTCH-381: ----------------------------------------- Your last comment confirms my suspicions. After analysis of the code in Fetcher I can confirm that this indeed is the effect of handling redirects immediately - Fetcher doesn't check if the URLs we redirect to belong to the same host. The solution is to disable immediate redirects (set http.redirect.max to 0 in your configuration). > Ignore external link not work as expected > ----------------------------------------- > > Key: NUTCH-381 > URL: https://issues.apache.org/jira/browse/NUTCH-381 > Project: Nutch > Issue Type: Bug > Affects Versions: 0.8.1 > Reporter: Uros Gruber > Priority: Critical > > Currently there is no way to properly limit fetcher without regexp rules we > use ignore.external.link option but It seams that It doesn't work in all > cases. > Here is example urls I'm seeing but > cat urls1 urls2 urls3 urls/urls |grep yahoo.com doesn't return any hit. > fetching http://help.yahoo.com/help/sports > fetching http://www.turkish-xxx.com/adult-traffic-trade.php > fetching http://help.yahoo.com/help/us/astr/ > fetching http://www.polish-xxx.com/de-index.html > fetching http://www.driversplanet.com/Articles/Software/SpareBackup2.4.aspx > fetching http://help.yahoo.com/help/groups > fetching http://help.yahoo.com/help/fin/ > fetching > http://www.driversplanet.com/Articles/Software/WindowsStorageServer2003R2.aspx > fetching http://help.yahoo.com/help/us/edit/ > fetching http://www.polish-xxx.com/es-index.html > Anyone notice this? > I assume that there must be something with expired domains where pages > generates randomly. But still why urls from other domain was added. Maybe > urlregexp filter +* exclude. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers