Lucifersam wrote:
Andrzej Bialecki wrote:
Lucifersam wrote:
Finally - I seem to have a problem with identical pages with different
urls
- i.e.

http://website/
http://website/default.htm

I was under the impression that these would be removed by the dedup
process,
but this does not seem to be working. Is there something I'm missing?
Most likely the pages are slightly different - you can save them to files, and then run a diff utility to check for differences.


You're right, there was a small difference in the HTML concerning some
timing comment, e.g:

<!--Exec time = 265.625-->

As this is not strictly content - is there a simply way to ignore anything
within comments when looking at the content of a page?


You can provide your own implementation of a Signature - please see the javadocs for this class - and then set this class in nutch-site.xml.

A common trick is to use just the plain text version of the page, and further "normalize" it by replacing all whitespace with exactly single spaces, bringing all tokens to lowercase, optionally filter out all digits, and also optionally removing all words that occur only once.

--
Best regards,
Andrzej Bialecki     <><
___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com


Reply via email to