[ 
http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439248 ] 
            
Doug Cook commented on NUTCH-353:
---------------------------------

This is definitely a complex issue. It is also high priority -- issues with 
redirects and duplicates, which URL is chosen, and what happens to the anchor 
text for the pages involved are causing significant relevance issues.
A few observations:

(1) A redirect *target* is not always the canonical version of a URL. For 
example, is very common for root-level pages to redirect to an internal home 
page (some 30% of the root pages in my index do so). However, the root pages 
have all the anchor text and are truly the canonical, permanent version of the 
page; the internal redirect target is just the "temporary" homepage, and could 
change at any time depending on the site implementation. Here are some examples:
    http://www.landwirtschaft-bw.info/
    http://www.dlr-rnh.rlp.de/
    http://www.niederoesterreich.at/
Because of the current policy of "discarding" the redirect source, I lose 30% 
of the home pages in my index, which makes my relevance very poor for 
navigational queries.

In this case, we would likely want to mark the internal redirect target as an 
alias as Andrzej suggests, and automatically transfer any link information to 
the root page.

(2) There may be other cases where we want to alias two pages, either to avoid 
recrawling them, or to merge anchor text. Suppose we crawl both 
     http://www.x.com/
and
     http://www.x.com/index.html
and these are the same document.

Right now we will always crawl both of these, and the dedup algorithm will pick 
one (sadly often the /index.html version due to strange score anomalies), and 
throw out the anchor text for the other. While we can't safely normalize these 
two URLs to be the same in advance of seeing the content, once we see that the 
signatures are the same, we can, and should, merge them so that the index.html 
version is marked as an alias of the / version, and future crawls simply skip 
crawling the /index.html version and transfer its link information to the / 
page.

This problem, like the first one, is causing me to lose root-level URLs along 
with their anchor text, further affecting relevance for navigational queries.

In short, I agree with Andrzej that we need a way to mark a URL as an alias of 
another, to avoid recrawl, and to merge link information. We need to be 
careful, however, of *which* URL we pick. It is not always the redirect target 
that should win. And some of our current concept of "duplicates" should also be 
subsumed under the new notion of "alias."

I'm happy to help out in any way with a fix. I'm just looking at hacking 
together something in my own environment because the problems are affecting me 
so severely, but as I'm new-ish to Nutch, what I come up with might not be as 
elegant or flexible as what others might envision...

> pages that serverside forwards will be refetched every time
> -----------------------------------------------------------
>
>                 Key: NUTCH-353
>                 URL: http://issues.apache.org/jira/browse/NUTCH-353
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 0.8.1, 0.9.0
>            Reporter: Stefan Groschupf
>         Assigned To: Andrzej Bialecki 
>            Priority: Blocker
>             Fix For: 0.9.0
>
>         Attachments: doNotRefecthForwarderPagesV1.patch
>
>
> Pages that do a serverside forward are not written with a status change back 
> into the crawlDb. Also the nextFetchTime is not changed. 
> This causes a refetch of the same page again and again. The result is nutch 
> is not polite and refetching the forwarding and target page in each segment 
> iteration. Also it effects the scoring since the forward page contribute it's 
> score to all outlinks.

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to