I think it's safe to strip anchors, as they simply point to a different portion of the same page for browser rendering. I do that for Simpy while normalizing URLs, in order not to have duplicates like this.
Otis ----- Original Message ---- From: Ken Krugler <[EMAIL PROTECTED]> To: [email protected] Sent: Thu 05 Jan 2006 04:40:07 PM EST Subject: Normalizing URLs with anchors Hi all, The default regex-normalize.xml currently strips out PHP session ids. I'm wondering whether it would also make sense to remove anchor text from URLs. For example, currently these two URLs are treated as different: <http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex>http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex and <http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex>http://www.dina.kvl.dk/~sestoft/gcsharp/index.html Is it safe to always strip # followed by (valid anchor characters) at the end of a URL? Thanks, -- Ken -- Ken Krugler Krugle, Inc. +1 530-470-9200 ------------------------------------------------------- This SF.net email is sponsored by: Splunk Inc. Do you grep through log files for problems? Stop! Download the new AJAX search engine that makes searching your log files as easy as surfing the web. DOWNLOAD SPLUNK! http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click _______________________________________________ Nutch-developers mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/nutch-developers
