You are correct that these two urls are considered different docs by mcf.  
ideally the url canonicalization would somehow choose one or the other and do 
the right mapping.  but this is hard and i know of no automatic algorithm that 
would work.

Other crawlers have this problem too, but let me think about it.  maybe we can 
add some kind of optional rule to the canonicalization tab.

Karl

Sent from my Nokia phone
-----Original Message-----
From: Erlend Garåsen
Sent:  01/07/2011, 8:09  AM
To: [email protected]
Subject: Duplicate documents and MCF



If I understand ManifoldCF correctly, a unique document is a document 
with a distinct URL such as
http://www.example.org/foo/index.html

Therefore I guess that MCF treats the following document as different 
compared to the example above:
http://www.example.org/foo/

After I did a huge crawl, I now have a lot of duplicate documents in my 
Solr index, and I'm not quite sure how to cope with this problem. I 
guess I have several options:
1) Give root urls a higher score. Then duplicates such as the first 
example above will be listed further down in the search result list.
2) Filter out index.html documents, but then I do not have any guarantee 
that the root url has been indexed (in case links to the documents were 
only pointing to index.html.
3) Store a hashed value generated out of the documents' content in order 
to give them a unique id.

Erlend
-- 
Erlend Garåsen
Center for Information Technology Services
University of Oslo
P.O. Box 1086 Blindern, N-0317 OSLO, Norway
Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050

Reply via email to