>What C library are you using? Your message above implies libc5
>on Linux. I'm using Red Hat Linux 4.2 on my web server, which comes
>with libc 5.3.12. I've tried all sorts of things, and I've come to the
>conclusion that locale support in this C library is hopelessly broken.
>I could not get it to work despite all my attempts.
The web server where I run ht://dig is a SlackWare (2.0.35) with glibc1
(libc5).
>On the other hand, with Red Hat Linux 5.2, which uses glibc, locales
>seem to work without any difficulties at all.
I am now willing to try ht://dig on another Linux Machine with RedHat 6.0
where locale seems to work on it. If it works, I can try to move on a
recent platform, but problem with broken locales is still alive ...
>I've thought of how ht://Dig could be fixed to work with broken locales.
>The extra_word_characters attribute is a good first step. If you add
>all the accented characters to this, they'll get indexed.
Thanks a lot for this hint ... At least, now I get them indexed.
>The problem
>is ht://Dig won't know how to convert them from uppercase to lowercase,
>or vice versa. I've thought of adding extra_word_casemap as a means
>of specifying these mappings. In this way, the HtWordType functions
>would supplement all the ctype stuff, in a way that's user configurable.
>It's a shame that we'd need to resort to this, because this is exactly
>what the locale stuff is supposed to do for us, but with so many broken
>locales out there, I think there's a need for this.
I am not very hooked on fuzzy algorithms (obviously, it goes w/out saying
... ;-) ), but is it a problem to link single chars to string of 2 chars? I
try to explain better ...
if I want to search for a '�' ending word, I also have to search for "e'",
which is 2 chars long. And so:
'�' <-> "e'"
'�' <-> "E'"
And viceversa ... Instead, I don't think we don't need this conversion in
the middle of the word (or better, in italian we use to do this way).
>As for mapping accented to unaccented letters, as Geoff said, this has
>been discussed to some length about a week or so ago. My suggestion
>was to implement it something like soundex, where it will go through the
>word database after htdig/htmerge, and create another database keyed on
>the canonical (unindexed) form of all of these words. This algorithm
>could be configured either through a file, or perhaps better still,
>a config attribute (which could be taken from a file if desired) such
>as accent_map. This map would allow you to specify precisely how to map
>various accented letters or digraphs to certain canonical representations.
I wish I could contribute to this, but I think that now I am too busy and,
moreover, as soon as I can re-start contributing, I have to set up the
HtHTTP and Transport classes and the Retrieving code ... Shame on me !!! I
am also waiting for 2 big C++ books and with the new year comin', I want to
dedicate more time to study C++ and OO programming (I have been still for a
long time ...).
And, if it's not enough, I also want to stay with my girlfriend: HEY, she's
american, 21 and really BEATIFUL !!! Will you ever forgive me if, for now,
I can't dedicate much spare time to programming? ;-@ - Just kidding ...
But if you have some directives I will be very glad to help anybody who
wants to do that ...
Ciao
-Gabriele
-------------------------------------------------
Gabriele Bartolini
Computer Programmer (are U sure?)
U.O. Rete Civica - Comune di Prato
Prato - Italia - Europa
e-mail: [EMAIL PROTECTED]
http://www.po-net.prato.it
"Life teaches you never stop learning ..."
-------------------------------------------------
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.