Doğacan Güney wrote:

Hmm. The index should somehow contain _all_ urls, which point to the
same document. I.e. when you search for url "http://example.com"; it
should ideally return exactly the same Lucene document as when you
search for "http://www.example.com/index.html";.

Why would you do a search with the full name of the url? I also don't
understand why we need to have all urls in index (we already eliminate
near-duplicates with dedup).  I guess I am missing your use case
here...

Let's say I'm searching for "test" and I want to limit the search to a particular url. I enter a query:

        test url:example.com

It should yield the same results as for the following query:

        test url:www.example.com

(assuming they are "aliases").

Another, more realistic example: I'm searching for IBM products. So I enter a query:

        products site:ibm.com

This should yield the same results as any of the following:

        products site:www.ibm.com
        products site:www-128.ibm.com
        products site:www-304.ibm.com

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to