Re: How to treat # in URLs?

Carl Cerecke Wed, 15 Aug 2007 14:33:59 -0700

Doh! I had copied the plugin.includes property from nutch-default.xmlto nutch-site.xml and somehow accidentally inserted a newline so itlooked like [...]urlno

rmalizer[...]

Oops,
Carl.


Enis Soztutar wrote:

Technically, the fragment is a part of the url, but foo and foo#barpoints to the same location, so it should be stripped out. Are you usingurl-normalizers. If not could you please try them.
Carl Cerecke wrote:
Hi,
I noticed that urls with a # in them are not handled any differentlyto normal urls. See output of readdb:
http://127.0.0.1:8000/about.html        Version: 5
Status: 2 (db_fetched)
Fetch time: Thu Sep 13 14:41:55 NZST 2007
Modified time: Thu Jan 01 12:00:00 NZST 1970
Retries since fetch: 0
Retry interval: 2592000.0 seconds (30.0 days)
Score: 4.0
Signature: c79a4a20d6a19603120d1fdbaf19b0eb
Metadata: _pst_:success(1), lastModified=0

http://127.0.0.1:8000/about.html#top    Version: 5
Status: 2 (db_fetched)
Fetch time: Thu Sep 13 14:42:03 NZST 2007
Modified time: Thu Jan 01 12:00:00 NZST 1970
Retries since fetch: 0
Retry interval: 2592000.0 seconds (30.0 days)
Score: 4.0
Signature: c79a4a20d6a19603120d1fdbaf19b0eb
Metadata: _pst_:success(1), lastModified=0
I would have expected that, when doing an updatedb, the #foobar partof the URL would be stripped.
Is there a sensible reason for the current behaviour? Or have I founda bug?
Cheers,
Carl.
_____________________________________________________________________

This has been cleaned & processed by www.rocketspam.co.nz
_____________________________________________________________________

Re: How to treat # in URLs?

Reply via email to