On 2009-12-10 19:59, Jesse Hires wrote:
I'm seeing a lot of duplicates where a single site is getting recognized as two different sites. Specifically I am seeing www.domain.com and domain.combeing recognized as two different sites. I imagine there is a setting to prevent this. If so, what is the setting, if not, what would you recomend doing to prevent this?
This is a surprisingly difficult problem to solve in general case, because it's not always true that 'www.domain' equals 'domain'. If you do know this is true in your particular case, you can add a rule to regex-urlnormalizer that changes the matching urls to e.g. always lose the 'www.' part.
-- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com
