You are correct that these two urls are considered different docs by mcf. ideally the url canonicalization would somehow choose one or the other and do the right mapping. but this is hard and i know of no automatic algorithm that would work.
Other crawlers have this problem too, but let me think about it. maybe we can add some kind of optional rule to the canonicalization tab. Karl Sent from my Nokia phone -----Original Message----- From: Erlend Garåsen Sent: 01/07/2011, 8:09 AM To: [email protected] Subject: Duplicate documents and MCF If I understand ManifoldCF correctly, a unique document is a document with a distinct URL such as http://www.example.org/foo/index.html Therefore I guess that MCF treats the following document as different compared to the example above: http://www.example.org/foo/ After I did a huge crawl, I now have a lot of duplicate documents in my Solr index, and I'm not quite sure how to cope with this problem. I guess I have several options: 1) Give root urls a higher score. Then duplicates such as the first example above will be listed further down in the search result list. 2) Filter out index.html documents, but then I do not have any guarantee that the root url has been indexed (in case links to the documents were only pointing to index.html. 3) Store a hashed value generated out of the documents' content in order to give them a unique id. Erlend -- Erlend Garåsen Center for Information Technology Services University of Oslo P.O. Box 1086 Blindern, N-0317 OSLO, Norway Ph: (+47) 22840193, Fax: (+47) 22852970, Mobile: (+47) 91380968, VIP: 31050
