Normalizing host names (e.g. www1|www2 => www)

ogjunk-nutch Fri, 25 Apr 2008 16:10:06 -0700

Hello,

How are people dealing with avoiding page duplication where URL are similar,
but the content is identical.  I know there is page content fingerprinting and
shingling (MD5Signature and TextProfileSignature), but that assumes you
already fetched the content.  I am wondering if its possible to detect things
earlier than that, even if it's not 10)% reliable.


Concretely, imagine the following URLs:
    http://example.com
    http://www.example.com
    http://www1.example.com
    http://www2.example.com

They are all, very likely, pointing to the same page.  One person may link to
www.example.com and the other person may link to just example.com, thus
we parse 2 different URLs, when ideally we'd want just a single URL for each 
page.  Similarly, the example site may have multiple web servers (e.g. for
load balancing), but each with a slightly different name  (e.g. www1.... , 
www2....),
pointing to the same site.

What's the best way to treat www1 and www2 as just www?
Are people using regex-normalize.xml for that?

Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

Normalizing host names (e.g. www1|www2 => www)

Reply via email to