Re: Redirects and alias handling (LONG)

Andrzej Bialecki Tue, 21 Aug 2007 07:37:45 -0700

Doğacan Güney wrote:

If the same content is available under multiple urls, I think it makes
sense to assume that the url with the highest score should be 'the
representative'  url.

Not necessarily - it depends how you defined your score.http://www.ibm.com/ may actually have a low score, because itimmediately redirects to http://www.ibm.com/index.html (actually, itredirects to http://www.ibm.com/us/index.html).

Also, "the shortest url wins" rule is not always true. Let's say I own adomain a.biz, and I made a Wikipedia mirror there. Which of the pages ismore representative: http://a.biz/About_Wikipedia orhttp://www.wikipedia.org/en/About_Wikipedia ?

3. Link and anchor information for aliases and redirects.
---------------------------------------------------------
This issue has been briefly discussed in NUTCH-353. Inlink information
should be "merged" so that all link information from all "aliases" is
aggregated, so that it points to a selected canonical target URL.


We should also merge their score. If example.com (with score 4.0) is
an alias for www.example.com (with score 8.0), the selected url (which
I think, as I said before, should be www.example.com)  should end up
with the score 12.0. We may not want to do this for aliases in
different domains but I think we should definitely do this if two urls
with the same content are under the same domain (like example.com).


I think you are right - at least with the OPIC scoring it would work ok.


Regarding Lucene indexes - we could either duplicate all data for each
non-canonical URL, i.e. create as many full-blown Lucene documents as
many there are aliases, or we could create special "redirect" documents
that would point to a URL which contains the full data ...


We can avoid doing both. Let's assume A redirects to B, C also
redirects to B and B redirects to D. After the fetch/parse/updatedb
cycle that processes D we would probably have enough data to choose
the 'canonical url' (let's assume that canonical is B). Then during
Indexer's reduce we can just index parse text and parse data (and
whatever else) of D under url B since we won't index B (or A or C) as
itself (it doesn't have any useful content after all).

Hmm. The index should somehow contain _all_ urls, which point to thesame document. I.e. when you search for url "http://example.com"; itshould ideally return exactly the same Lucene document as when yousearch for "http://www.example.com/index.html";.

Similarly, the inlink information for all "aliased" urls should be thesame (but in our case it's not a Lucene issue, only the LinkDb aliasingissue).


That's it for now ... Any comments or suggestions to the above are welcome!


Andrzej, have you written any code? I would suggest that we open a
JIRA and have some code (no matter how much half-baked it is) as soon
as we can.


Not yet - I'll open the issue and put these initial thoughts there.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Re: Redirects and alias handling (LONG)

Reply via email to