Hi Otis,

On Sat, Apr 26, 2008 at 2:09 AM, <[EMAIL PROTECTED]> wrote:

> Hello,
>
> How are people dealing with avoiding page duplication where URL are
> similar,
> but the content is identical.  I know there is page content fingerprinting
> and
> shingling (MD5Signature and TextProfileSignature), but that assumes you
> already fetched the content.  I am wondering if its possible to detect
> things
> earlier than that, even if it's not 10)% reliable.
>
> Concretely, imagine the following URLs:
>    http://example.com
>    http://www.example.com
>    http://www1.example.com
>    http://www2.example.com
>
> They are all, very likely, pointing to the same page.  One person may link
> to
> www.example.com and the other person may link to just example.com, thus
> we parse 2 different URLs, when ideally we'd want just a single URL for
> each
> page.  Similarly, the example site may have multiple web servers (e.g. for
> load balancing), but each with a slightly different name  (e.g. www1.... ,
> www2....),
> pointing to the same site.
>
> What's the best way to treat www1 and www2 as just www?
> Are people using regex-normalize.xml for that?


I remember a similar discussion a while ago. IIRC, Andrzej suggested that we
may use a form of host analysis if different hosts are mirrors of each
other. That is, we run an extra job to check if two different domains has a
high enough number of same pages (same here being defined by same
fingerprint) with the same path, if so we then choose the overall higher
scoring host (or if these two hosts are under same domain, then we choose
the shortest one or the one starting with www. etc.) as the representative
host and merge all urls under those hosts.



>
>
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
>
>
>


-- 
Doğacan Güney

Reply via email to