KH> Aleksey,
AC>> Before looking for word in .index file dictd converts it to lower
AC>> case and removes non-alphanumeric characters from the word (if no
AC>> 00-database-allchars is found of cause). This is necessary to
AC>> ignore non-alphanumeric characters in search and make the search
AC>> case-insensitive. 'dictfmt' builds .index file the same way. This
AC>> is why 00-database-short could not be found in your databases.
KH> I am curious as to how this is special for uft8. Don't the same
KH> requirements of a case-insensitive search apply to non-uft8? So, why
KH> then is the full 00-database-short allowed in a non-uft8 index even
KH> when 00-database-allchars is omitted?
'sort -df -k 1,3' is used for sorting ASCII dictionary
This allows us to keep nonalphanumeric characters in .index.
Also all characters are in their original case.
Some info from sort manual:
-d, --dictionary-order
consider only blanks and alphanumeric characters
-f, --ignore-case
fold lower case to upper case characters
dictd in turn uses appropriate sorting compare function,
see index.c:compare_alnumspace for details.
This is how dict/dictfmt was designed by Rick.
The same method is possible for UTF-8 dictionary
(and the very first version worked this way), but
later (before releasing anything) I changed sorting order
both in dictfmt and dictd.
Now all words in .index are "normalized", i.e. lowercased
and only alnum chars are kept in them.
Benefits:
- 'sort' utility doesn't need be aware of UTF-8.
- Sorting order is trivial, byte-to-byte.
- Much simplier and much faster compare function in dictd,
see index.c:compare_allchars
Disadvantageous:
- MATCH command returns "normalized" words, but the original one.
I have a plan to implement fourth column in .index file
to keep original word.
P.S.
Here the correct compare function is selected:
static int compare(
const char *word,
const dictIndex *dbindex,
const char *start, const char *end )
{
...
if (dbindex &&
(dbindex -> flag_allchars || dbindex -> flag_utf8 ||
dbindex -> flag_8bit))
{
return compare_allchars( word, start, end );
}else{
return compare_alnumspace( word, dbindex, start, end );
}
}
Upper level functions call 'tolower_alnumspace' to "normalize" query.
--
Best regards, Aleksey Cheusov.
--
To UNSUBSCRIBE, email to [EMAIL PROTECTED]
with a subject of "unsubscribe". Trouble? Contact [EMAIL PROTECTED]