Hello,
How are people dealing with avoiding page duplication where URL are similar,
but the content is identical. I know there is page content fingerprinting and
shingling (MD5Signature and TextProfileSignature), but that assumes you
already fetched the content. I am wondering if its possible to detect things
earlier than that, even if it's not 10)% reliable.
Concretely, imagine the following URLs:
http://example.com
http://www.example.com
http://www1.example.com
http://www2.example.com
They are all, very likely, pointing to the same page. One person may link to
www.example.com and the other person may link to just example.com, thus
we parse 2 different URLs, when ideally we'd want just a single URL for each
page. Similarly, the example site may have multiple web servers (e.g. for
load balancing), but each with a slightly different name (e.g. www1.... ,
www2....),
pointing to the same site.
What's the best way to treat www1 and www2 as just www?
Are people using regex-normalize.xml for that?
Thanks,
Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch