[ http://issues.apache.org/jira/browse/NUTCH-381?page=comments#action_12440453 ] Uros Gruber commented on NUTCH-381: -----------------------------------
I try to found what happened through the logs but because threads I didn't found any connection. I also try with linksdb. For example I search www.polish-xxx.com but found only fromUrl link and It's strange. If I understand this correctly I this case no url pointing to this url. I have linksdb gziped with 15MB. I can send you somewhere or place it to our server if it's any help. with -noAdditions I'm to late. I already updatedb with those links. > Ignore external link not work as expected > ----------------------------------------- > > Key: NUTCH-381 > URL: http://issues.apache.org/jira/browse/NUTCH-381 > Project: Nutch > Issue Type: Bug > Affects Versions: 0.8.1 > Reporter: Uros Gruber > Priority: Critical > > Currently there is no way to properly limit fetcher without regexp rules we > use ignore.external.link option but It seams that It doesn't work in all > cases. > Here is example urls I'm seeing but > cat urls1 urls2 urls3 urls/urls |grep yahoo.com doesn't return any hit. > fetching http://help.yahoo.com/help/sports > fetching http://www.turkish-xxx.com/adult-traffic-trade.php > fetching http://help.yahoo.com/help/us/astr/ > fetching http://www.polish-xxx.com/de-index.html > fetching http://www.driversplanet.com/Articles/Software/SpareBackup2.4.aspx > fetching http://help.yahoo.com/help/groups > fetching http://help.yahoo.com/help/fin/ > fetching > http://www.driversplanet.com/Articles/Software/WindowsStorageServer2003R2.aspx > fetching http://help.yahoo.com/help/us/edit/ > fetching http://www.polish-xxx.com/es-index.html > Anyone notice this? > I assume that there must be something with expired domains where pages > generates randomly. But still why urls from other domain was added. Maybe > urlregexp filter +* exclude. -- This message is automatically generated by JIRA. - If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa - For more information on JIRA, see: http://www.atlassian.com/software/jira
