[
http://issues.apache.org/jira/browse/NUTCH-353?page=comments#action_12439248 ]
Doug Cook commented on NUTCH-353:
---------------------------------
This is definitely a complex issue. It is also high priority -- issues with
redirects and duplicates, which URL is chosen, and what happens to the anchor
text for the pages involved are causing significant relevance issues.
A few observations:
(1) A redirect *target* is not always the canonical version of a URL. For
example, is very common for root-level pages to redirect to an internal home
page (some 30% of the root pages in my index do so). However, the root pages
have all the anchor text and are truly the canonical, permanent version of the
page; the internal redirect target is just the "temporary" homepage, and could
change at any time depending on the site implementation. Here are some examples:
http://www.landwirtschaft-bw.info/
http://www.dlr-rnh.rlp.de/
http://www.niederoesterreich.at/
Because of the current policy of "discarding" the redirect source, I lose 30%
of the home pages in my index, which makes my relevance very poor for
navigational queries.
In this case, we would likely want to mark the internal redirect target as an
alias as Andrzej suggests, and automatically transfer any link information to
the root page.
(2) There may be other cases where we want to alias two pages, either to avoid
recrawling them, or to merge anchor text. Suppose we crawl both
http://www.x.com/
and
http://www.x.com/index.html
and these are the same document.
Right now we will always crawl both of these, and the dedup algorithm will pick
one (sadly often the /index.html version due to strange score anomalies), and
throw out the anchor text for the other. While we can't safely normalize these
two URLs to be the same in advance of seeing the content, once we see that the
signatures are the same, we can, and should, merge them so that the index.html
version is marked as an alias of the / version, and future crawls simply skip
crawling the /index.html version and transfer its link information to the /
page.
This problem, like the first one, is causing me to lose root-level URLs along
with their anchor text, further affecting relevance for navigational queries.
In short, I agree with Andrzej that we need a way to mark a URL as an alias of
another, to avoid recrawl, and to merge link information. We need to be
careful, however, of *which* URL we pick. It is not always the redirect target
that should win. And some of our current concept of "duplicates" should also be
subsumed under the new notion of "alias."
I'm happy to help out in any way with a fix. I'm just looking at hacking
together something in my own environment because the problems are affecting me
so severely, but as I'm new-ish to Nutch, what I come up with might not be as
elegant or flexible as what others might envision...
> pages that serverside forwards will be refetched every time
> -----------------------------------------------------------
>
> Key: NUTCH-353
> URL: http://issues.apache.org/jira/browse/NUTCH-353
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 0.8.1, 0.9.0
> Reporter: Stefan Groschupf
> Assigned To: Andrzej Bialecki
> Priority: Blocker
> Fix For: 0.9.0
>
> Attachments: doNotRefecthForwarderPagesV1.patch
>
>
> Pages that do a serverside forward are not written with a status change back
> into the crawlDb. Also the nextFetchTime is not changed.
> This causes a refetch of the same page again and again. The result is nutch
> is not polite and refetching the forwarding and target page in each segment
> iteration. Also it effects the scoring since the forward page contribute it's
> score to all outlinks.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira
-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys -- and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers