On 8/21/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Doğacan Güney wrote: > > > If the same content is available under multiple urls, I think it makes > > sense to assume that the url with the highest score should be 'the > > representative' url. > > Not necessarily - it depends how you defined your score. > http://www.ibm.com/ may actually have a low score, because it > immediately redirects to http://www.ibm.com/index.html (actually, it > redirects to http://www.ibm.com/us/index.html). > > Also, "the shortest url wins" rule is not always true. Let's say I own a > domain a.biz, and I made a Wikipedia mirror there. Which of the pages is > more representative: http://a.biz/About_Wikipedia or > http://www.wikipedia.org/en/About_Wikipedia ? > > > >> 3. Link and anchor information for aliases and redirects. > >> --------------------------------------------------------- > >> This issue has been briefly discussed in NUTCH-353. Inlink information > >> should be "merged" so that all link information from all "aliases" is > >> aggregated, so that it points to a selected canonical target URL. > > > > We should also merge their score. If example.com (with score 4.0) is > > an alias for www.example.com (with score 8.0), the selected url (which > > I think, as I said before, should be www.example.com) should end up > > with the score 12.0. We may not want to do this for aliases in > > different domains but I think we should definitely do this if two urls > > with the same content are under the same domain (like example.com). > > I think you are right - at least with the OPIC scoring it would work ok. > > > >> > >> Regarding Lucene indexes - we could either duplicate all data for each > >> non-canonical URL, i.e. create as many full-blown Lucene documents as > >> many there are aliases, or we could create special "redirect" documents > >> that would point to a URL which contains the full data ... > > > > We can avoid doing both. Let's assume A redirects to B, C also > > redirects to B and B redirects to D. After the fetch/parse/updatedb > > cycle that processes D we would probably have enough data to choose > > the 'canonical url' (let's assume that canonical is B). Then during > > Indexer's reduce we can just index parse text and parse data (and > > whatever else) of D under url B since we won't index B (or A or C) as > > itself (it doesn't have any useful content after all). > > Hmm. The index should somehow contain _all_ urls, which point to the > same document. I.e. when you search for url "http://example.com" it > should ideally return exactly the same Lucene document as when you > search for "http://www.example.com/index.html".
Why would you do a search with the full name of the url? I also don't understand why we need to have all urls in index (we already eliminate near-duplicates with dedup). I guess I am missing your use case here... > > Similarly, the inlink information for all "aliased" urls should be the > same (but in our case it's not a Lucene issue, only the LinkDb aliasing > issue). I agree with you here. > > > > > >> > >> That's it for now ... Any comments or suggestions to the above are welcome! > > > > Andrzej, have you written any code? I would suggest that we open a > > JIRA and have some code (no matter how much half-baked it is) as soon > > as we can. > > Not yet - I'll open the issue and put these initial thoughts there. > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Doğacan Güney
