Doğacan Güney wrote:
If the same content is available under multiple urls, I think it makes
sense to assume that the url with the highest score should be 'the
representative' url.
Not necessarily - it depends how you defined your score.
http://www.ibm.com/ may actually have a low score, because it
immediately redirects to http://www.ibm.com/index.html (actually, it
redirects to http://www.ibm.com/us/index.html).
Also, "the shortest url wins" rule is not always true. Let's say I own a
domain a.biz, and I made a Wikipedia mirror there. Which of the pages is
more representative: http://a.biz/About_Wikipedia or
http://www.wikipedia.org/en/About_Wikipedia ?
3. Link and anchor information for aliases and redirects.
---------------------------------------------------------
This issue has been briefly discussed in NUTCH-353. Inlink information
should be "merged" so that all link information from all "aliases" is
aggregated, so that it points to a selected canonical target URL.
We should also merge their score. If example.com (with score 4.0) is
an alias for www.example.com (with score 8.0), the selected url (which
I think, as I said before, should be www.example.com) should end up
with the score 12.0. We may not want to do this for aliases in
different domains but I think we should definitely do this if two urls
with the same content are under the same domain (like example.com).
I think you are right - at least with the OPIC scoring it would work ok.
Regarding Lucene indexes - we could either duplicate all data for each
non-canonical URL, i.e. create as many full-blown Lucene documents as
many there are aliases, or we could create special "redirect" documents
that would point to a URL which contains the full data ...
We can avoid doing both. Let's assume A redirects to B, C also
redirects to B and B redirects to D. After the fetch/parse/updatedb
cycle that processes D we would probably have enough data to choose
the 'canonical url' (let's assume that canonical is B). Then during
Indexer's reduce we can just index parse text and parse data (and
whatever else) of D under url B since we won't index B (or A or C) as
itself (it doesn't have any useful content after all).
Hmm. The index should somehow contain _all_ urls, which point to the
same document. I.e. when you search for url "http://example.com" it
should ideally return exactly the same Lucene document as when you
search for "http://www.example.com/index.html".
Similarly, the inlink information for all "aliased" urls should be the
same (but in our case it's not a Lucene issue, only the LinkDb aliasing
issue).
That's it for now ... Any comments or suggestions to the above are welcome!
Andrzej, have you written any code? I would suggest that we open a
JIRA and have some code (no matter how much half-baked it is) as soon
as we can.
Not yet - I'll open the issue and put these initial thoughts there.
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com