Title: RE: [htdig] Accent problem.

According to Gilles Detilleux:

> According to "NEPOTE Charles (Neuilly Gestion)":
> > I am searching to solve some problems in ht://Dig 3.1.5.
> >
> > I tested and reproduce that :
> >
> > If :
> >  -- more than one html file contains : both words "tu�" and
> "tue" per file ;
> >  -- or an html files contains the word "tue" and the html
> which is reffering
> > to it contains the word "tu�" (or the reverse case)
> >     [exemple : d0.htm containing "<a href="d1.htm">UN HOMME
> TUE</a>" and
> > d1.htm containing "tu�"]
> >
> > Then a search for "tu�" or a search for "tue" will only
> find the last file
> > indexed which contains both "tu�" and "tue".
> >
> > In the file db.wordlist we can see for example :
> > tue    i:0 [...]
> > tue    i:1 [...]
> > tu�    i:1 [...]
> > tue    i:2 [...]
> > tu�    i:2 [...]
> >
> > (only the file which correspond to "i:2" will be found).
> >
> > Is this can be solve ?
> > (Note I have in htdig.conf :
> > locale: fr_FR
> > )
>
> Yes, the locale seems to be working fine, as accented letters
> are taken as
> part of the words in the word list.  I assume the entries
> from db.wordlist are as you find them after running htmerge.


Yes.


> It's odd, but the sort seems to treat accented and
> unaccented letters as equivalent, and I wonder if
> that isn't throwing off htmerge's creation of the db.words.db
> database.
> Otherwise, it seems all the "tu�" entries should come after
> the "tue" entries.


Yes, that's it.



> Either that, or the latter database is corrupted,
> and so isn't
> working right.
>
> Does the problem persist even after you regenerate the database from
> scratch?  (htdig -i; htmerge)


I made very serious test : with only 7 documents, always regenerating the database from scratch to prevent corruption problems of the database ; to do so I used :

time rundig -v -s -a -c /etc/htdig/htdig.essai.conf|tee /var/lib/htdig/essai1.txt

and I always controlled the process. The rundig script is the original script (not modified).
I am quite sure the database is not corrupted.
So it should be a problem of sorting...


My config :
Pentium Pro 200
Linux Mandrake 7.0 ; automatic install in french.
(As I am a Linux newbie, I don't know which things would help you. One think I am quite sure is I didn't made much changes on the original config. In particular, I didn't make "locale" changes (I don't know how to do it !...)).

ht://Dig 3.1.5 installed via a RPM specially made for Mandrake 7.0, by MandrakeSoft, downloded at :
ftp://ftp.ciril.fr/pub/linux/mandrake-devel/contrib/RPMS/htdig-3.1.5-2mdk.i586.rpm
(note ftp.ciril.fr is an official mirror for MandrakeSoft).
I made an normal install of the RPM without changing anything but the htdig.conf file :
 -- I add locale: fr_FR
 -- I modified other attributes which not deal with locale problem.


> You may also want to try setting your
> LOCALE environment variable to something other than fr_FR
> (e.g. en_US),
> so that the sort will not do any accent folding, if indeed that is
> the problem.


Strange thing : when I put locale: en_US in htdig.essai.conf, the result is the same !
And accented chars are still in db.wordlist, in the same order as before...


 
> > <cultural parenthesis>
> > At the beginning of automatic typewritters (first moity of
> the century),
> > there was nos accented uppercases such as �� (the machines were
> > anglo-saxons) and so, the usage of accented lowercase
> desapear in  common
> > usage : nowadays, many teachers in France teach that "there
> is never accent
> > in a lowercase". (In fact there is accented lowercase in
> all newpapers,
> > books printed by professionnals who know the rule that there must be
> > accented lowercase -- there is accented lowercase in France
> since the
> > beginning of prints).
> > This is a problem as accents have a sence :
> > "un homme tu�" : means "a man killed"
> > "un homme tue" : means "a man kills".
> > How to understand : "UN HOMME TUE" if there is no accented
> lowercase ?
> > </cultural parenthesis>.
>
> I believe you mean uppercase where you say lowercase. 
> Uppercase letters
> are capitals (majuscules), while lowercase letters are small
> (minuscules).


Ooops, yes ! Sorry.


> Some French teachers in Canada also taught not to put accents
> on capitals,
> but it didn't really catch on.  I never realized that
> convention came about
> just because of the difficulty of using accents on typewriters.


Actual machines are still going against cultural diversity : there is nothing to type easily accented UPPERCASE on are french (and probably even Quebec) keyboards. (You have to remember Alt+0201 for an �...).

 
> --
> Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
> Spinal Cord Research Centre       WWW:   
> http://www.scrc.umanitoba.ca/~grdetil
> Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
> Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930
>
> ------------------------------------
> To unsubscribe from the htdig mailing list, send a message to
> [EMAIL PROTECTED]
> You will receive a message to confirm this.
>

Reply via email to