I think it's safe to strip anchors, as they simply point to a different portion 
of the same page for browser rendering.  I do that for Simpy while normalizing 
URLs, in order not to have duplicates like this.

Otis

----- Original Message ----
From: Ken Krugler <[EMAIL PROTECTED]>
To: [email protected]
Sent: Thu 05 Jan 2006 04:40:07 PM EST
Subject: Normalizing URLs with anchors

Hi all,

The default regex-normalize.xml currently strips out PHP session ids.

I'm wondering whether it would also make sense to remove anchor text 
from URLs. For example, currently these two URLs are treated as 
different:

<http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex>http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex

and

<http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex>http://www.dina.kvl.dk/~sestoft/gcsharp/index.html

Is it safe to always strip # followed by (valid anchor characters) at 
the end of a URL?

Thanks,

-- Ken
-- 
Ken Krugler
Krugle, Inc.
+1 530-470-9200




-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to