Ken Krugler wrote:
I'm wondering whether it would also make sense to remove anchor text from URLs. For example, currently these two URLs are treated as different:

http://www.dina.kvl.dk/~sestoft/gcsharp/index.html#wordindex
and

http://www.dina.kvl.dk/~sestoft/gcsharp/index.html Is it safe to always strip # followed by (valid anchor characters) at the end of a URL?

Yes, I think so.  Please submit a patch.

Are there other common session ids that we should remove in this file?

Doug


-------------------------------------------------------
This SF.net email is sponsored by: Splunk Inc. Do you grep through log files
for problems?  Stop!  Download the new AJAX search engine that makes
searching your log files as easy as surfing the  web.  DOWNLOAD SPLUNK!
http://ads.osdn.com/?ad_id=7637&alloc_id=16865&op=click
_______________________________________________
Nutch-developers mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to