On 2009-12-10 19:59, Jesse Hires wrote:
I'm seeing a lot of duplicates where a single site is getting recognized as
two different sites. Specifically I am seeing www.domain.com and
domain.combeing recognized as two different sites.
I imagine there is a setting to prevent this. If so, what is the setting, if
not, what would you recomend doing to prevent this?

This is a surprisingly difficult problem to solve in general case, because it's not always true that 'www.domain' equals 'domain'. If you do know this is true in your particular case, you can add a rule to regex-urlnormalizer that changes the matching urls to e.g. always lose the 'www.' part.


--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to