Hi:
I'm running Build #722 on a Macintosh, using 10.4.11 and am indexing
about 10,000 URLs from a single site. All is well, except I am
getting double-indexes of some files.
For example
http://www.newsinc.net/morgue/2003/ni031110.html
and
http://www.newsinc.net/morgue/2003/NI031110.html
Because the web server is also a Mac-based system, from the Apache
(and file system) viewpoint, these are the same file. Nutch sees them
as two different files and indexes them twice. Search results present
both URLs.
Ideally, there is a parameter somewhere that I can change to make
URLs case-insensitive. I have Google'd Nutch URL normalization, but
those postings seem to deal with issues such as
http://my.domain.com:80/ vs. http://my.domain.com/ ...
Any thoughts about how to resolve this (admittedly minor) problem
would be appreciated.
Thanks.
\dmc
--
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
David M. Cole d...@colegroup.com
Editor & Publisher, NewsInc. <http://newsinc.net> V: (650) 557-2993
Consultant: The Cole Group <http://colegroup.com/> F: (650) 475-8479
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+