Hi:

I'm running Build #722 on a Macintosh, using 10.4.11 and am indexing about 10,000 URLs from a single site. All is well, except I am getting double-indexes of some files.

For example

http://www.newsinc.net/morgue/2003/ni031110.html

and

http://www.newsinc.net/morgue/2003/NI031110.html

Because the web server is also a Mac-based system, from the Apache (and file system) viewpoint, these are the same file. Nutch sees them as two different files and indexes them twice. Search results present both URLs.

Ideally, there is a parameter somewhere that I can change to make URLs case-insensitive. I have Google'd Nutch URL normalization, but those postings seem to deal with issues such as http://my.domain.com:80/ vs. http://my.domain.com/ ...

Any thoughts about how to resolve this (admittedly minor) problem would be appreciated.

Thanks.

\dmc

--
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+
   David M. Cole                                            d...@colegroup.com
   Editor & Publisher, NewsInc. <http://newsinc.net>        V: (650) 557-2993
   Consultant: The Cole Group <http://colegroup.com/>       F: (650) 475-8479
*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+*+

Reply via email to