For the specific case I was running into (on a single known domain) using
regex-urlnormalizer did the trick. Thanks!



Jesse

int GetRandomNumber()
{
   return 4; // Chosen by fair roll of dice
                // Guaranteed to be random
} // xkcd.com



On Thu, Dec 10, 2009 at 1:01 PM, Andrzej Bialecki <[email protected]> wrote:

> On 2009-12-10 19:59, Jesse Hires wrote:
>
>> I'm seeing a lot of duplicates where a single site is getting recognized
>> as
>> two different sites. Specifically I am seeing www.domain.com and
>> domain.combeing recognized as two different sites.
>>
>> I imagine there is a setting to prevent this. If so, what is the setting,
>> if
>> not, what would you recomend doing to prevent this?
>>
>
> This is a surprisingly difficult problem to solve in general case, because
> it's not always true that 'www.domain' equals 'domain'. If you do know this
> is true in your particular case, you can add a rule to regex-urlnormalizer
> that changes the matching urls to e.g. always lose the 'www.' part.
>
>
>
> --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>

Reply via email to