[ 
https://issues.apache.org/jira/browse/NUTCH-353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12466284
 ] 

Doug Cook commented on NUTCH-353:
---------------------------------

I have a local fix for this problem (partly Paul Gauthier's work, partly mine) 
that I have been testing for some time. It's a little bit of a hack, but it's 
much better than just indexing the redirect target (which is the wrong behavior 
in many instances; see comments earlier).

The fix is to index both instances of the page, both the source and the target, 
making sure that the outlinks from the target page are only assigned to the 
target page. This way, in the (frequent) case that the redirect *source* is the 
canonical version of the page, with more anchor text, it will show up for 
searches. The fix seems to work pretty well, and solves a significant 
percentage of Nutch's "missing home pages" problem without using much extra 
space in the index. If it sounds useful to anyone, I'm happy to contribute it 
back.

Doug

> pages that serverside forwards will be refetched every time
> -----------------------------------------------------------
>
>                 Key: NUTCH-353
>                 URL: https://issues.apache.org/jira/browse/NUTCH-353
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.0
>            Reporter: Stefan Groschupf
>         Assigned To: Andrzej Bialecki 
>            Priority: Blocker
>             Fix For: 0.9.0
>
>         Attachments: doNotRefecthForwarderPagesV1.patch
>
>
> Pages that do a serverside forward are not written with a status change back 
> into the crawlDb. Also the nextFetchTime is not changed. 
> This causes a refetch of the same page again and again. The result is nutch 
> is not polite and refetching the forwarding and target page in each segment 
> iteration. Also it effects the scoring since the forward page contribute it's 
> score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
https://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to