According to Roman Maeder:
> [EMAIL PROTECTED] said:
> > After some code-walking -thanks to open source!- I recognized that a
> > "locale: de_DE.ISO8859-1"  in the htdig.conf file will help. I was
> > just fondling around with the LC_* environment vars before. 

You shouldn't have to resort to walking through the code for this.
There is documentation:

http://www.htdig.org/attrs.html#locale
http://www.htdig.org/FAQ.html#q5.8
http://www.htdig.org/FAQ.html#q4.10

> > I suggest for upcoming htdig versions to introduce a
> > "setlocale(LC_ALL, "");" in the beginning of the htdig/htsearch mains,
> > since this would set the program locale according the env vars ;-) 
> 
> hmm, this would make it hard to ensure that the environment used
> for digging (probably under cron) and for searchig (under your http server)
> are the same. The config file seems to me better place to specify
> collation and character class info.

I think what Matthias was suggesting was to use the enviroment variable
initially, so that it would set the default locale, which could then be
overridden with the config attribute.  That's how I interpreted it, as
he never said anything about removing support for the locale attribute.
This way, you can use either technique, which I think is a good idea.
In htdig, though, the code would have to set LC_TIME back to "C" after
setting LC_ALL, so that If-Modified-Since headers come out in the standard
format and not a locale-dependent one.

> Also, for efficiency, you probably always want to use the "C" locale
> for collation. It doesn't really matter what collation sequence you
> use as long as the one used for building the index is the same as the
> one used for searching it.

For htmerge, it's indeed very important to set LC_COLLATE to C if
you're in a different locale, provided your system "sort" program is
locale-aware.  Version 3.1.5 or older of htmerge has a problem handling
accented characters otherwise.  In 3.1.6, htmerge is fixed not to lose
words in the word database when the wordlist is sorted according to a
different locale, but it will run slower and produce a bigger database
than if you sort using the C locale.  The rundig script in 3.1.6 sets
LC_COLLATE correctly, but if you run htmerge from other scripts, you
should take care to do likewise.

Actually, the same goes for a lot of shell scripts that use the sort
program, either directly or indirectly.  I'm in the process of migrating
some stuff from Red Hat 4.2 to 7.2, and many of my shell scripts are
breaking because Red Hat's sort program has been locale-aware since 6.x.
I still contend that making sort be locale-aware by default was a really
bad design decision for this very reason.  This feature should have been
enabled by a command-line option, somewhat akin to the -f option, because
conceptually the two are quite similar (accented characters are "folded"
into non-accented counterparts).

The 3.2 versions of htdig will be immune to LC_COLLATE changes, as they
don't use an external sort program.  The DB package doesn't use LC_COLLATE.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to