Doğacan Güney wrote:
Hmm. The index should somehow contain _all_ urls, which point to the
same document. I.e. when you search for url "http://example.com" it
should ideally return exactly the same Lucene document as when you
search for "http://www.example.com/index.html".
Why would you do a search with the full name of the url? I also don't
understand why we need to have all urls in index (we already eliminate
near-duplicates with dedup). I guess I am missing your use case
here...
Let's say I'm searching for "test" and I want to limit the search to a
particular url. I enter a query:
test url:example.com
It should yield the same results as for the following query:
test url:www.example.com
(assuming they are "aliases").
Another, more realistic example: I'm searching for IBM products. So I
enter a query:
products site:ibm.com
This should yield the same results as any of the following:
products site:www.ibm.com
products site:www-128.ibm.com
products site:www-304.ibm.com
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com