Re: [NTG-context] towards some more consistency in regimes unicode support

2005-09-14 Thread Mojca Miklavec
Thomas A. Schmitz wrote:
 Mojca,
 
 I'm not sure I've understood all you're trying to do, but I feel kind
 of responsible for the Greek.

Thank you very much, Thomas!

 I took the polutonic/ancient Greek
 basically from the Unicode names, but I left modern/monotonic Greek
 alone because the support was already there and I didn't want to mess
 up somebody else's work. As for the three slots you mention:
 
 037A GREEK YPOGEGRAMMENI
 0384 GREEK TONOS
 0385 GREEK DIALYTIKA TONOS
 
 These are characters that are never (?) used on their own, only to
 combine with vowels. But let me know if there are more
 inconcsitencies, and I'll try and fix them for the 31-vector.

I would say that the same is true for acute/grave/circumflex accent in
latin, but they're there and we need a name for them in order to be
able to compose (fake) characters out of it
(\buldtextaccent\textgrave{a} to get agrave). What do you do with
those characters in cp1253 encoding
http://www.microsoft.com/typography/unicode/1253.htm? Without those
definitions the cp1253 input encoding cannot be fully supported, but
is anyone using that regime at all? cp1250 (central european) is still
widely used for example.

For combining there are some others (unnamed):
0342 COMBINING GREEK PERISPOMENI
0343 COMBINING GREEK KORONIS
0344 COMBINING GREEK DIALYTIKA TONOS
0345 COMBINING GREEK YPOGEGRAMMENI
but they need special treetment (not supported in ConTeXt yet) anyway.

I know just about nothing about Greek fonts and their quality
(coverage of Greek glyphs), but even with a pretty incomplete font you
can then say something like:
\definecharacter greekomegatonos \buildtextaccent\greektonos\greekomega
and perhaps even
\definecharacter greektonos \textacute
where there is no special glyph for tonos present

I guess that
\greekypogegrammeni, \greektonos and \greekdialytikatonos would be
just fine, I just asked because there may be some cases (like with
many latin cedilla or stroke letters or hacek that was later
renamed into caron), where Unicode is not as accurate as one would
want it to be.

An example of inconsistency of names:
1F0C \greekAlphapsilitonos GREEK CAPITAL LETTER ALPHA WITH PSILI AND OXIA
1F0D \greekAlphadasiatonos GREEK CAPITAL LETTER ALPHA WITH DASIA AND OXIA

But I don't know anything about Greek, so I cannot judge which of the
names is more accurate.

Thanks again for help,
Mojca
___
ntg-context mailing list
ntg-context@ntg.nl
http://www.ntg.nl/mailman/listinfo/ntg-context


[NTG-context] towards some more consistency in regimes unicode support

2005-09-13 Thread Mojca Miklavec


Hello,

Sorry for a slightly longer mail. I wanted to send it to context-dev, 
but probably there's someone else besides Adam out there who could 
contribute (for example to re-chech Greek or Cyrillic section of Unicode 
or even add some missing Hebrew definitions for example). If someone 
thinks that it's more appropriate, please feel free to continue the 
discussion on context-dev.



I. in regi-utf it would be fine to add:

\defineregimesynonym[utf-8][utf]
\defineregimesynonym[utf8][utf]

II. After a long time I finally decided to write my first ruby script. I 
took UnicodeData.txt, adobe glyph list, enco-uc.tex, collected 
averything together, removed characters  (in case someone needs 
them they can trivially be added again, but I don't think that anyone is 
planning to name them shortly), did some manual corrections ... and here 
are the results:

http://pub.mojca.org/tex/enco/contextlist/
http://pub.mojca.org/tex/enco/contextbase/regi-temp.tex

The idea behind is that there is no definite refence to the ConTeXt 
glyph names, which means that every new regime that should be supported 
needs a lot of manual work and leads to many inconsistencies.


The file contextnames.txt contains the Unicode hexadecimal number, pdf 
name (from Adobe Glyph List), ConTeXt name and the Unicode name. This 
could then be a source of information when adding new regimes, writing 
unicode vectors (unic-*), mapping to font encodings, 
uppercasing/lowercasing information for font encoding and other files 
can now be derived directly from unicode and this list (unicode already 
contains information about upper/lowercase variants of the letters) ...


There is some more info missing, which should be either packed within 
the same file or in separate files:

- ConTeXt synonyms (like \Dcroat - \Dstroke, ...)
- pdf synonyms (dbar - dcroat), to help recognize the glyphs in .enc or 
.afm and automate support for it

- faking the characters (\ccaron - \buildtextaccent\textcaron{C})
- unaccented version of the characters (\Aacute - A, ...)
- other characters not present in unicode (Caron, Acute - these are 
accents for uppercase letters, ...)
- (I'm sure that I wanted to add some more points, but I don't remember 
any other right now)


When I wanted to add the names from unic-34.tex, I realized that we 
don't really need to have a command for every single unicode character 
(we certainly don't need to map math characters into that region), but 
if someone already has a file with unicode integrals, it costs nothing 
to give him those characters in output.
(Shortly: 0x2211, N-ARY SUMMATION should expand into $\sum$, but not 
the other way round)
I have to slightly change the syntax in the context glyph names file to 
note this difference and to be able to define math (and other) signs 
properly.



III. Now I need some help - someone should help me revise the file 
contextname.txt (I prepared a HTML version of it): correct mistakes (if 
any are spotted), add new definitions, help to prepare a list of 
synonyms, a list of expansions (\buildtextaccent), ...



Here are some points which I spotted, but can't fix them alone

1. Characters missing (needed by some regimes):

0020-007F section

037A GREEK YPOGEGRAMMENI
0384 GREEK TONOS
0385 GREEK DIALYTIKA TONOS
2015 HORIZONTAL BAR
2017 DOUBLE LOW LINE
20AA NEW SHEQEL SIGN
20AB DONG SIGN
20AF DRACHMA SIGN
2116 NUMERO SIGN
200E LEFT-TO-RIGHT MARK
200F RIGHT-TO-LEFT MARK

1Exx section

2. Greek - there are some name inconsistencies when compared to the 
unic-031 vector, but I don't know anything about old greek. I didn't 
check Cyrillic at all.


3. Punctuation and accents - mostly names for quotes and language 
dependency (lowerleftuppersixquote in comparison to lftdblquote ... or 
whatever they are called) (+ tricks, I already asked about quotes  
hyphenation approximately a week ago).
I have problems understanding the difference between letter modifiers 
(U+02Cx) and usual accents (U+00Bx), Combining Diacritical Marks 
(U+03xx) should be supported somehow as well. I have no idea how to make 
U+0065 U+0301 (e + combining acute accent) into eacute.


4. should hungarumlaut be doubleacute and hungarumlaut only its synonym 
or the other way round?


5. tbar vs. tstroke: compare 0166 and 023E

6. cedilla/commaaccent dilema: there's a huge problem with t with 
cedilla (0162): t with comma below (021A) sould be used instead (at 
least this is stated in Unicode reference), but most regimes map a 
character to t with cedilla (0162), which seems stupid to me. Adobe 
glyph list therefore uses tcommaaccent for t with cedilla, which looks 
like t with comma accent, but is on the wrong place. lmr have both 
tcommaaccent and tcedilla. \tcedilla should be t with cedilla in my 
opinion and \tcommaaccent t with comma accent. That currently isn't 
the 

Re: [NTG-context] towards some more consistency in regimes unicode support

2005-09-13 Thread Thomas A. Schmitz

Mojca,

I'm not sure I've understood all you're trying to do, but I feel kind  
of responsible for the Greek. I took the polutonic/ancient Greek  
basically from the Unicode names, but I left modern/monotonic Greek  
alone because the support was already there and I didn't want to mess  
up somebody else's work. As for the three slots you mention:


037A GREEK YPOGEGRAMMENI
0384 GREEK TONOS
0385 GREEK DIALYTIKA TONOS

These are characters that are never (?) used on their own, only to  
combine with vowels. But let me know if there are more  
inconcsitencies, and I'll try and fix them for the 31-vector.


Best

Thomas

On Sep 13, 2005, at 5:12 PM, Mojca Miklavec wrote:

2. Greek - there are some name inconsistencies when compared to the  
unic-031 vector, but I don't know anything about old greek. I  
didn't check Cyrillic at all.





___
ntg-context mailing list
ntg-context@ntg.nl
http://www.ntg.nl/mailman/listinfo/ntg-context