On 8/21/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Doğacan Güney wrote:
>
> > If the same content is available under multiple urls, I think it makes
> > sense to assume that the url with the highest score should be 'the
> > representative'  url.
>
> Not necessarily - it depends how you defined your score.
> http://www.ibm.com/ may actually have a low score, because it
> immediately redirects to http://www.ibm.com/index.html (actually, it
> redirects to http://www.ibm.com/us/index.html).
>
> Also, "the shortest url wins" rule is not always true. Let's say I own a
> domain a.biz, and I made a Wikipedia mirror there. Which of the pages is
> more representative: http://a.biz/About_Wikipedia or
> http://www.wikipedia.org/en/About_Wikipedia ?
>
>
> >> 3. Link and anchor information for aliases and redirects.
> >> ---------------------------------------------------------
> >> This issue has been briefly discussed in NUTCH-353. Inlink information
> >> should be "merged" so that all link information from all "aliases" is
> >> aggregated, so that it points to a selected canonical target URL.
> >
> > We should also merge their score. If example.com (with score 4.0) is
> > an alias for www.example.com (with score 8.0), the selected url (which
> > I think, as I said before, should be www.example.com)  should end up
> > with the score 12.0. We may not want to do this for aliases in
> > different domains but I think we should definitely do this if two urls
> > with the same content are under the same domain (like example.com).
>
> I think you are right - at least with the OPIC scoring it would work ok.
>
>
> >>
> >> Regarding Lucene indexes - we could either duplicate all data for each
> >> non-canonical URL, i.e. create as many full-blown Lucene documents as
> >> many there are aliases, or we could create special "redirect" documents
> >> that would point to a URL which contains the full data ...
> >
> > We can avoid doing both. Let's assume A redirects to B, C also
> > redirects to B and B redirects to D. After the fetch/parse/updatedb
> > cycle that processes D we would probably have enough data to choose
> > the 'canonical url' (let's assume that canonical is B). Then during
> > Indexer's reduce we can just index parse text and parse data (and
> > whatever else) of D under url B since we won't index B (or A or C) as
> > itself (it doesn't have any useful content after all).
>
> Hmm. The index should somehow contain _all_ urls, which point to the
> same document. I.e. when you search for url "http://example.com"; it
> should ideally return exactly the same Lucene document as when you
> search for "http://www.example.com/index.html";.

Why would you do a search with the full name of the url? I also don't
understand why we need to have all urls in index (we already eliminate
near-duplicates with dedup).  I guess I am missing your use case
here...

>
> Similarly, the inlink information for all "aliased" urls should be the
> same (but in our case it's not a Lucene issue, only the LinkDb aliasing
> issue).

I agree with you here.

>
>
> >
> >>
> >> That's it for now ... Any comments or suggestions to the above are welcome!
> >
> > Andrzej, have you written any code? I would suggest that we open a
> > JIRA and have some code (no matter how much half-baked it is) as soon
> > as we can.
>
> Not yet - I'll open the issue and put these initial thoughts there.
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
Doğacan Güney

Reply via email to