Hello,
Sorry for a slightly longer mail. I wanted to send it to context-dev,
but probably there's someone else besides Adam out there who could
contribute (for example to re-chech Greek or Cyrillic section of Unicode
or even add some missing Hebrew definitions for example). If someone
thinks that it's more appropriate, please feel free to continue the
discussion on context-dev.
I. in regi-utf it would be fine to add:
\defineregimesynonym[utf-8][utf]
\defineregimesynonym[utf8][utf]
II. After a long time I finally decided to write my first ruby script. I
took UnicodeData.txt, adobe glyph list, enco-uc.tex, collected
averything together, removed characters (in case someone needs
them they can trivially be added again, but I don't think that anyone is
planning to name them shortly), did some manual corrections ... and here
are the results:
http://pub.mojca.org/tex/enco/contextlist/
http://pub.mojca.org/tex/enco/contextbase/regi-temp.tex
The idea behind is that there is no definite refence to the ConTeXt
glyph names, which means that every new regime that should be supported
needs a lot of manual work and leads to many inconsistencies.
The file contextnames.txt contains the Unicode hexadecimal number, pdf
name (from Adobe Glyph List), ConTeXt name and the Unicode name. This
could then be a source of information when adding new regimes, writing
unicode vectors (unic-*), mapping to font encodings,
uppercasing/lowercasing information for font encoding and other files
can now be derived directly from unicode and this list (unicode already
contains information about upper/lowercase variants of the letters) ...
There is some more info missing, which should be either packed within
the same file or in separate files:
- ConTeXt synonyms (like \Dcroat - \Dstroke, ...)
- pdf synonyms (dbar - dcroat), to help recognize the glyphs in .enc or
.afm and automate support for it
- faking the characters (\ccaron - \buildtextaccent\textcaron{C})
- unaccented version of the characters (\Aacute - A, ...)
- other characters not present in unicode (Caron, Acute - these are
accents for uppercase letters, ...)
- (I'm sure that I wanted to add some more points, but I don't remember
any other right now)
When I wanted to add the names from unic-34.tex, I realized that we
don't really need to have a command for every single unicode character
(we certainly don't need to map math characters into that region), but
if someone already has a file with unicode integrals, it costs nothing
to give him those characters in output.
(Shortly: 0x2211, N-ARY SUMMATION should expand into $\sum$, but not
the other way round)
I have to slightly change the syntax in the context glyph names file to
note this difference and to be able to define math (and other) signs
properly.
III. Now I need some help - someone should help me revise the file
contextname.txt (I prepared a HTML version of it): correct mistakes (if
any are spotted), add new definitions, help to prepare a list of
synonyms, a list of expansions (\buildtextaccent), ...
Here are some points which I spotted, but can't fix them alone
1. Characters missing (needed by some regimes):
0020-007F section
037A GREEK YPOGEGRAMMENI
0384 GREEK TONOS
0385 GREEK DIALYTIKA TONOS
2015 HORIZONTAL BAR
2017 DOUBLE LOW LINE
20AA NEW SHEQEL SIGN
20AB DONG SIGN
20AF DRACHMA SIGN
2116 NUMERO SIGN
200E LEFT-TO-RIGHT MARK
200F RIGHT-TO-LEFT MARK
1Exx section
2. Greek - there are some name inconsistencies when compared to the
unic-031 vector, but I don't know anything about old greek. I didn't
check Cyrillic at all.
3. Punctuation and accents - mostly names for quotes and language
dependency (lowerleftuppersixquote in comparison to lftdblquote ... or
whatever they are called) (+ tricks, I already asked about quotes
hyphenation approximately a week ago).
I have problems understanding the difference between letter modifiers
(U+02Cx) and usual accents (U+00Bx), Combining Diacritical Marks
(U+03xx) should be supported somehow as well. I have no idea how to make
U+0065 U+0301 (e + combining acute accent) into eacute.
4. should hungarumlaut be doubleacute and hungarumlaut only its synonym
or the other way round?
5. tbar vs. tstroke: compare 0166 and 023E
6. cedilla/commaaccent dilema: there's a huge problem with t with
cedilla (0162): t with comma below (021A) sould be used instead (at
least this is stated in Unicode reference), but most regimes map a
character to t with cedilla (0162), which seems stupid to me. Adobe
glyph list therefore uses tcommaaccent for t with cedilla, which looks
like t with comma accent, but is on the wrong place. lmr have both
tcommaaccent and tcedilla. \tcedilla should be t with cedilla in my
opinion and \tcommaaccent t with comma accent. That currently isn't
the