Re: [sword-devel] dictionary ordering revisited

Daniel Owens Thu, 19 Mar 2009 18:37:37 -0700


DM Smith wrote:

Daniel Owens wrote:
I'm working on a dictionary with keys based on the lemma in theMorphGNT module (which so far has no dictionary support). I amrunning into two problems:
1. Since the keys are polytonic Greek, the byte ordering method ofcreating the index totally destroys the ordering of the dictionary(same problem with Vietnamese). BPBible does a reasonable job ofdictionary lookup (so far the only front-end coming near tosupporting dictionary keys in polytonic Greek), but many obviouslookups like anthropos are thrown off because the words starting withalpha are separated in groups spread out over several places in theindex. Looking back in the archives, I saw a comment from Troy fromOctober 2007: "Generating a secondary index on a lexdict whichpreserves some other order and alternate key is great idea and aneasy addition to the current code." Has anything been done with this?
2. The use of upper case for the display of keys in front-ends istotally unnatural. Can I plead that something be done about this?Surely it is an easy fix, or is it more than a display issue. Notfixing it makes SWORD totally unfriendly for Koine Greek students...
I was going to ask the same thing today, as I was looking at the wikifor TEI dictionaries.
The problem is a bit deeper than that.
Chris has pointed out that byte ordering and code point ordering ofUTF-8 are the same.
The first problem is that of normalization of the keys in the module.This has several aspects.1) In UTF-8, several different code point sequences can result in thesame glyphs. We have chosen to use ICU's NFC normalization. tei2moddoes normalization the other LD module creators (e.g. imp2ld) don't.2) As you noted UPPER CASE keys are ugly. Some are unreadable (e.g.multiply accented capital Greek letters). Worse than that somelanguages don't have upper case representations of lower case letters.I haven't heard of any, but the reverse might be a problem. And othersdo, but it is not yet represented in Unicode (e.g. Cherokee).3) Normalization can result in an odd ordering for end users. In somelanguages the ordering of code points is not proper. For example,German dictionaries, Spanish dictionaries and French dictionariesdiffer with respect to how they order accented characters. ICUsupplies collation keys on a per language basis for this.
The second problem is that of normalizing the search request. This hasseveral aspects.1) The search request has to be normalized in exactly the same fashionas the creation of the module. Using the same technique, but adifferent normalizer might result in a different normalization. It maybe that a minimum version ICU is necessary. (Hopefully, later versionsare backward compatible.)2) User input may ignore accents. (e.g. do a dictionary lookup from aGreek text that lacks accents, or from a Hebrew text that has vowelpoints off). Or they may enter a transliteration (e.g. use oikos tolookup house in a polytonic Greek dictionary).


All of this is helpful background to the potential issues involved.

I think a solution can be layered on top of the module as it is today.Basically, one or more secondary indexes are used to do lookup in thefirst. Maybe one is with accents and another without. Lucene can beused easily to create a single lookup with multiple fields where eachfield is a different representation of the key.

This makes sense to me. Add to accented/not accented: withvowels/without vowels (Hebrew) and transliterated (using SBL's scheme?).

I would like to see a solution that is part of the module or a part ofthe SWORD engine.
In Him,
   DM

I think the engine should support multiple keys in a single dictionarymodule so something like <entryFree n="ἀγαπάω|agapaō|G25"> would be afeasible entry. A user could look up the same word in both the MorphGNTand TR modules without having to switch dictionaries, and manualtransliterated user input would return the correct entry (or near toit). Typically modules only use one sort of key for lemma, so the modulecould determine what key would be looked up in the dictionary, perhapswith a conf entry. The module creator would need to generate the keys,and the engine would be set up to handle these multiple keys (which itdoesn't appear to be able to do now).


Daniel

_______________________________________________
sword-devel mailing list: [email protected]
http://www.crosswire.org/mailman/listinfo/sword-devel
Instructions to unsubscribe/change your settings at above page

Re: [sword-devel] dictionary ordering revisited

Reply via email to