Re: Normalizing host names (e.g. www1|www2 => www)

Ken Krugler Sun, 27 Apr 2008 11:16:34 -0700

At 4:09 pm -0700 4/25/08, [EMAIL PROTECTED] wrote:

Hello,


How are people dealing with avoiding page duplication where URL are similar,
but the content is identical.  I know there is page content fingerprinting and
shingling (MD5Signature and TextProfileSignature), but that assumes you
already fetched the content.  I am wondering if its possible to detect things
earlier than that, even if it's not 10)% reliable.

Concretely, imagine the following URLs:
    http://example.com
    http://www.example.com
    http://www1.example.com
    http://www2.example.com

They are all, very likely, pointing to the same page.  One person may link to
www.example.com and the other person may link to just example.com, thus
we parse 2 different URLs, when ideally we'd want just a single URL for each
page.  Similarly, the example site may have multiple web servers (e.g. for

load balancing), but each with a slightly different name (e.g.www1.... , www2....),

pointing to the same site.

What's the best way to treat www1 and www2 as just www?
Are people using regex-normalize.xml for that?

We do, for domains we determine as having a significant issue withthis type of multiple URL identity. So (unfortunately) it's a manualprocess, for domains that we deep crawl.

IIRC, there was also the issue of link scoring, where you wanted toensure that page A (e.g. normalized at www.example.com) gotappropriate OPIC scores from links going to example.com,www1.example.com, etc. Currently this isn't the case, even if thepage similarity calculation determines that two pages should be thesame.


-- Ken
--
Ken Krugler
Krugle, Inc.
+1 530-210-6378
"If you can't find it, you can't fix it"

Re: Normalizing host names (e.g. www1|www2 => www)

Reply via email to