On 8/21/07, Andrzej Bialecki <[EMAIL PROTECTED]> wrote:
> Doğacan Güney wrote:
>
> >> Hmm. The index should somehow contain _all_ urls, which point to the
> >> same document. I.e. when you search for url "http://example.com"; it
> >> should ideally return exactly the same Lucene document as when you
> >> search for "http://www.example.com/index.html";.
> >
> > Why would you do a search with the full name of the url? I also don't
> > understand why we need to have all urls in index (we already eliminate
> > near-duplicates with dedup).  I guess I am missing your use case
> > here...
>
> Let's say I'm searching for "test" and I want to limit the search to a
> particular url. I enter a query:
>
>         test url:example.com
>
> It should yield the same results as for the following query:
>
>         test url:www.example.com
>
> (assuming they are "aliases").


I guess we can do something like this (continuing from my example
above): Index D's data under B then add a alias field to the lucene
document with A, C and D in it. Then change query-url so that a "url:"
query also searches the alias field.

>
> Another, more realistic example: I'm searching for IBM products. So I
> enter a query:
>
>         products site:ibm.com
>
> This should yield the same results as any of the following:
>
>         products site:www.ibm.com
>         products site:www-128.ibm.com
>         products site:www-304.ibm.com
>

Thanks for the explanation.

How do we know that www.ibm.com and www-128.ibm.com hosts are perfect
mirrors of one another? All we can know is that http://www.ibm.com/
and http://www-128.ibm.com/ *urls* are aliases of one another and that
for the urls that we have fetched *so far* they seem to mirror each
other. It is possible that the next URL we fetch from one of those
sites does not exist in the other. I don't think that we can ever be
certain that they are perfect mirrors of each other so, IMHO, we
shouldn't treat those queries as same. Google also doesn't return the
same results for "products site:www.ibm.com" "products
site:www-128.ibm.com" .

(One small unrelated note: As discussed in NUTCH-439 and NUTCH-445, we
should treat site:ibm.com as all hosts under domain ibm.com even if
http://www.ibm.com/ and http://ibm.com/ are perfect mirrors of each
other.)


> --
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
Doğacan Güney

Reply via email to