At 4:09 pm -0700 4/25/08, [EMAIL PROTECTED] wrote:
Hello,

How are people dealing with avoiding page duplication where URL are similar,
but the content is identical.  I know there is page content fingerprinting and
shingling (MD5Signature and TextProfileSignature), but that assumes you
already fetched the content.  I am wondering if its possible to detect things
earlier than that, even if it's not 10)% reliable.

Concretely, imagine the following URLs:
    http://example.com
    http://www.example.com
    http://www1.example.com
    http://www2.example.com

They are all, very likely, pointing to the same page.  One person may link to
www.example.com and the other person may link to just example.com, thus
we parse 2 different URLs, when ideally we'd want just a single URL for each
page.  Similarly, the example site may have multiple web servers (e.g. for
load balancing), but each with a slightly different name (e.g. www1.... , www2....),
pointing to the same site.

What's the best way to treat www1 and www2 as just www?
Are people using regex-normalize.xml for that?

We do, for domains we determine as having a significant issue with this type of multiple URL identity. So (unfortunately) it's a manual process, for domains that we deep crawl.

IIRC, there was also the issue of link scoring, where you wanted to ensure that page A (e.g. normalized at www.example.com) got appropriate OPIC scores from links going to example.com, www1.example.com, etc. Currently this isn't the case, even if the page similarity calculation determines that two pages should be the same.

-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Reply via email to