Re: [NTG-context] Greek in luatex

Arthur Reutenauer Wed, 12 Sep 2007 18:15:44 -0700

        Hello Thomas,

  I was waiting for someone else to answer your questions because I
had no clue how to address them even if I was interested; but now I do,
thanks to Hans' reply:


  For your general problem you need to define a new regime that will
map each relevant character sequence to the corresponding Unicode
character.  That is, you inform ConTeXt that the character stream it sees
is actually a way of coding another set of characters and that it can
forget the original stream.  This treatment should be done before any sort
of font property intervenes, because it does not depend on the
appearance of the typeset text.  That's what regimes are for.

  Now I turn up to Hans to give us guidelines on how to define an
advanced regime in Mark IV: Hans, what we need here is to replace
sequences of characters by other characters, so the mapping is not
one-to-one and it's more complicated than simple regimes defined by a
table lookup; but I guess all we have to do is write a lua function that
we could plug into the input stream reading routine (just like other
regimes work).

  As far as the rest of Hans' reply is concerned (Opentype features and
such), I would like to add that it is a very interesting and fascinating
thing to do, but definitely not what you want here, for a lot of
reasons: Opentype features can be used to alter the appearance of the
text, but the not nature of characters themselves.  That is, if you did
the transformation of your input stream at the font level, you would
actually tell ConTeXt that you are handling Latin characters with a
special appearance (that the font takes care of), so for example, the
underlying text in a PDF would be a stream of Latin characters, and
copying-and-pasting would yield Latin characters, not Greek.  That is
not what you want here: you want your "a" to be understood as "alpha"
and your "less-than acute-sign w vertical-bar" to be considered an
"omega with dasia, varia and subscribed iota".  Nor should you think of
these transformations as a collection of ligatures (which act at the
font level), but rather as a text encoding, just like UTF-8 is an
encoding of the Unicode characters: in UTF-8 the byte sequence
"hexadecimal byte E1, hexadecimal byte BC, hexadecimal byte 80" is the
coding for the Unicode character U+1F00 GREEK SMALL LETTER ALPHA WITH PSILI,
and in the Babel input scheme for Ancient Greek the same character is
encoded with the byte sequence "hexadecimal byte 3C [ASCII '<'],
hexadecimal byte 61 [ASCII 'a']".

  Of course in the past, these transformations were handled at the font
level and sequences like "< a" were actually ligatures, because that was
all we had (and copypasting from a PDF was, mostly, doomed to fail); but
we should not persist in that use now we can treat them as real Unicode
characters.

  As for your other question in your original message from September 1st
(remapping single characters, for example U+03C3 to U+03F2), I have to
say first that I'm not very comfortable commenting on it since I'm not
quite sure what the issues are here; it may be that you have a simple
variant of some character, and this you should handle at font level
(some glyph being transformed into some other one); but if I am to judge
by the very example you gave, I would deem this should be a part of your
input regime: indeed, if every sigma is to be mapped to lunate sigma,
then it probably means that the lunate sigmas are part of your character
stream (even if you didn't input it directly).  But I really can't give
any general advice here, especially because I don't actually know what a
lunate sigma really is ;-)  You would have to decide for yourself as a
specialist of Greek if you're dealing with really different characters
or simple font variants; in the former case you should handle the
transformation as a part of your regime; in the latter, by defining a
font feature like Hans demonstrated.

  But for now, as long as it is understood that font tricks aren't the
general solution for the problem at stake, I would like to demonstrate
that it is still possible to do everything at font level :-)

  If you have a look at the attached greek-babel.tex (and the features
definition file greek-babel.fea) you will see that (almost) everything
is taken care of using Opentype substitutions.  You need Bosporos and
GFS Baskerville to compile the file; by the way, the line with GFS
Baskerville is a further proof that you shouldn't handle the
transformation at font level: can you explain why it doesn't work here?
As a compliment, I also attach the Perl script which I wrote to generate
the .fea file.

        Arthur

% For Thomas Schmitz.
% Define a new Opentype feature to replace new Babel input scheme and use it
% with some polytonic Greek fonts

% Not quite complete; some rhos with breathings and accents are missing from
% the .fea file (where are they?) and the final sigma isn't accounted for.
\installfontfeature[otf][grbl]

\definefontfeature
   [greek-babel]
   [mode=node,language=dflt,script=latn,
    grbl=yes,featurefile=greek-babel.fea]

\font\grbask=name:GFSBaskerville*greek-babel at 20pt
\font\bosphoros=name:BosporosU*greek-babel at 20pt

\starttext

\catcode`\~=11

\bosphoros
Peis'istratis m'en o>~un >egkateg'hrase t~h|
>arq~h| ka`i >ap'ejane nos'hsas >ep`i Fil'onew >'rqontos,
af' o<~ou m`en kat'esth t`o pr~wton t'urannos >'eth tri'akonta
ka`i tr'ia Bi'wsas, <`a d' >en t~h| >arq~h| di'emeinen
<enos d'eonta e>'ikosi; >'efeuge g`ap t`a loip`a.

% Don't do that!
\grbask
Peis'istratis m'en o>~un >egkateg'hrase t~h|
>arq~h| ka`i >ap'ejane nos'hsas >ep`i Fil'onew >'rqontos,
af' o<~ou m`en kat'esth t`o pr~wton t'urannos >'eth tri'akonta
ka`i tr'ia Bi'wsas, <`a d' >en t~h| >arq~h| di'emeinen
<enos d'eonta e>'ikosi; >'efeuge g`ap t`a loip`a.

\stoptext

# An Opentype feature to replace the Babel input scheme

# Not quite complete; some rhos with breathings and accents are missing (where
# are they?) and the final sigma isn't accounted for.

lookup GreekBabelLookupSimple {
    lookupflag 0 ;
        sub a   by alpha ;
        sub b   by beta ;
        sub g   by gamma ;
        sub d   by delta ;
        sub e   by epsilon ;
        sub z   by zeta ;
        sub h   by eta ;
        sub j   by theta ;
        sub i   by iota ;
        sub k   by kappa ;
        sub l   by lambda ;
        sub m   by mu ;
        sub n   by nu ;
        sub x   by xi ;
        sub o   by omicron ;
        sub p   by pi ;
        sub r   by rho ;
        sub c   by sigmafinal ;
        sub s   by sigma ;
        sub t   by tau ;
        sub u   by upsilon ;
        sub f   by phi ;
        sub q   by chi ;
        sub y   by psi ;
        sub w   by omega ;
        sub A   by Alpha ;
        sub B   by Beta ;
        sub G   by Gamma ;
        sub D   by Delta ;
        sub E   by Epsilon ;
        sub Z   by Zeta ;
        sub H   by Eta ;
        sub J   by Theta ;
        sub I   by Iota ;
        sub K   by Kappa ;
        sub L   by Lambda ;
        sub M   by Mu ;
        sub N   by Nu ;
        sub X   by Xi ;
        sub O   by Omicron ;
        sub P   by Pi ;
        sub R   by Rho ;
        sub C   by Uni03C2 ;
        sub S   by Sigma ;
        sub T   by Tau ;
        sub U   by Upsilon ;
        sub F   by Phi ;
        sub Q   by Chi ;
        sub Y   by Psi ;
        sub W   by Omega ;
        sub semicolon   by periodcentered ;
} GreekBabelLookupSimple ;

lookup GreekBabelLookupMultiple {
    lookupflag 1 ;
        # sub s 'space by sigmafinal ;
        sub greater  a by uni1F00 ;
        sub greater  A by uni1F08 ;
        sub greater  e by uni1F10 ;
        sub greater  E by uni1F18 ;
        sub greater  h by uni1F20 ;
        sub greater  H by uni1F28 ;
        sub greater  i by uni1F30 ;
        sub greater  I by uni1F38 ;
        sub greater  o by uni1F40 ;
        sub greater  O by uni1F48 ;
        sub greater  u by uni1F50 ;
        # sub greater  U by uni1F58 ;
        sub greater  w by uni1F60 ;
        sub greater  W by uni1F68 ;
        sub greater grave a by uni1F02 ;
        sub greater grave A by uni1F0A ;
        sub greater grave e by uni1F12 ;
        sub greater grave E by uni1F1A ;
        sub greater grave h by uni1F22 ;
        sub greater grave H by uni1F2A ;
        sub greater grave i by uni1F32 ;
        sub greater grave I by uni1F3A ;
        sub greater grave o by uni1F42 ;
        sub greater grave O by uni1F4A ;
        sub greater grave u by uni1F52 ;
        # sub greater grave U by uni1F5A ;
        sub greater grave w by uni1F62 ;
        sub greater grave W by uni1F6A ;
        sub greater quotesingle a by uni1F04 ;
        sub greater quotesingle A by uni1F0C ;
        sub greater quotesingle e by uni1F14 ;
        sub greater quotesingle E by uni1F1C ;
        sub greater quotesingle h by uni1F24 ;
        sub greater quotesingle H by uni1F2C ;
        sub greater quotesingle i by uni1F34 ;
        sub greater quotesingle I by uni1F3C ;
        sub greater quotesingle o by uni1F44 ;
        sub greater quotesingle O by uni1F4C ;
        sub greater quotesingle u by uni1F54 ;
        sub greater quotesingle U by uni1F5C ;
        sub greater quotesingle w by uni1F64 ;
        sub greater quotesingle W by uni1F6C ;
        sub greater asciitilde a by uni1F06 ;
        sub greater asciitilde A by uni1F0E ;
        sub greater asciitilde e by uni1F16 ;
        sub greater asciitilde E by uni1F1E ;
        sub greater asciitilde h by uni1F26 ;
        sub greater asciitilde H by uni1F2E ;
        sub greater asciitilde i by uni1F36 ;
        sub greater asciitilde I by uni1F3E ;
        sub greater asciitilde o by uni1F46 ;
        sub greater asciitilde O by uni1F4E ;
        sub greater asciitilde u by uni1F56 ;
        sub greater asciitilde U by uni1F5E ;
        sub greater asciitilde w by uni1F66 ;
        sub greater asciitilde W by uni1F6E ;
        sub less  a by uni1F01 ;
        sub less  A by uni1F09 ;
        sub less  e by uni1F11 ;
        sub less  E by uni1F19 ;
        sub less  h by uni1F21 ;
        sub less  H by uni1F29 ;
        sub less  i by uni1F31 ;
        sub less  I by uni1F39 ;
        sub less  o by uni1F41 ;
        sub less  O by uni1F49 ;
        sub less  u by uni1F51 ;
        sub less  U by uni1F59 ;
        sub less  w by uni1F61 ;
        sub less  W by uni1F69 ;
        sub less grave a by uni1F03 ;
        sub less grave A by uni1F0B ;
        sub less grave e by uni1F13 ;
        sub less grave E by uni1F1B ;
        sub less grave h by uni1F23 ;
        sub less grave H by uni1F2B ;
        sub less grave i by uni1F33 ;
        sub less grave I by uni1F3B ;
        sub less grave o by uni1F43 ;
        sub less grave O by uni1F4B ;
        sub less grave u by uni1F53 ;
        sub less grave U by uni1F5B ;
        sub less grave w by uni1F63 ;
        sub less grave W by uni1F6B ;
        sub less quotesingle a by uni1F05 ;
        sub less quotesingle A by uni1F0D ;
        sub less quotesingle e by uni1F15 ;
        sub less quotesingle E by uni1F1D ;
        sub less quotesingle h by uni1F25 ;
        sub less quotesingle H by uni1F2D ;
        sub less quotesingle i by uni1F35 ;
        sub less quotesingle I by uni1F3D ;
        sub less quotesingle o by uni1F45 ;
        sub less quotesingle O by uni1F4D ;
        sub less quotesingle u by uni1F55 ;
        sub less quotesingle U by uni1F5D ;
        sub less quotesingle w by uni1F65 ;
        sub less quotesingle W by uni1F6D ;
        sub less asciitilde a by uni1F07 ;
        sub less asciitilde A by uni1F0F ;
        sub less asciitilde e by uni1F17 ;
        sub less asciitilde E by uni1F1F ;
        sub less asciitilde h by uni1F27 ;
        sub less asciitilde H by uni1F2F ;
        sub less asciitilde i by uni1F37 ;
        sub less asciitilde I by uni1F3F ;
        sub less asciitilde o by uni1F47 ;
        sub less asciitilde O by uni1F4F ;
        sub less asciitilde u by uni1F57 ;
        sub less asciitilde U by uni1F5F ;
        sub less asciitilde w by uni1F67 ;
        sub less asciitilde W by uni1F6F ;
        sub grave a by uni1F70 ;
        sub quotesingle a by uni1F71 ;
        sub grave e by uni1F72 ;
        sub quotesingle e by uni1F73 ;
        sub grave h by uni1F74 ;
        sub quotesingle h by uni1F75 ;
        sub grave i by uni1F76 ;
        sub quotesingle i by uni1F77 ;
        sub grave o by uni1F78 ;
        sub quotesingle o by uni1F79 ;
        sub grave u by uni1F7A ;
        sub quotesingle u by uni1F7B ;
        sub grave w by uni1F7C ;
        sub quotesingle w by uni1F7D ;
        sub grave A by uni1FBA ;
        sub quotesingle A by uni1FBB ;
        sub grave E by uni1FC8 ;
        sub quotesingle E by uni1FC9 ;
        sub grave H by uni1FCA ;
        sub quotesingle H by uni1FCB ;
        sub grave I by uni1FDA ;
        sub quotesingle I by uni1FDB ;
        sub grave U by uni1FEA ;
        sub quotesingle U by uni1FEB ;
        sub grave W by uni1FFA ;
        sub quotesingle W by uni1FFB ;
        sub greater  a bar by uni1F80 ;
        sub greater  A bar by uni1F88 ;
        sub greater  h bar by uni1F90 ;
        sub greater  H bar by uni1F98 ;
        sub greater  w bar by uni1FA0 ;
        sub greater  W bar by uni1FA8 ;
        sub greater grave a bar by uni1F82 ;
        sub greater grave A bar by uni1F8A ;
        sub greater grave h bar by uni1F92 ;
        sub greater grave H bar by uni1F9A ;
        sub greater grave w bar by uni1FA2 ;
        sub greater grave W bar by uni1FAA ;
        sub greater quotesingle a bar by uni1F84 ;
        sub greater quotesingle A bar by uni1F8C ;
        sub greater quotesingle h bar by uni1F94 ;
        sub greater quotesingle H bar by uni1F9C ;
        sub greater quotesingle w bar by uni1FA4 ;
        sub greater quotesingle W bar by uni1FAC ;
        sub greater asciitilde a bar by uni1F86 ;
        sub greater asciitilde A bar by uni1F8E ;
        sub greater asciitilde h bar by uni1F96 ;
        sub greater asciitilde H bar by uni1F9E ;
        sub greater asciitilde w bar by uni1FA6 ;
        sub greater asciitilde W bar by uni1FAE ;
        sub less  a bar by uni1F81 ;
        sub less  A bar by uni1F89 ;
        sub less  h bar by uni1F91 ;
        sub less  H bar by uni1F99 ;
        sub less  w bar by uni1FA1 ;
        sub less  W bar by uni1FA9 ;
        sub less grave a bar by uni1F83 ;
        sub less grave A bar by uni1F8B ;
        sub less grave h bar by uni1F93 ;
        sub less grave H bar by uni1F9B ;
        sub less grave w bar by uni1FA3 ;
        sub less grave W bar by uni1FAB ;
        sub less quotesingle a bar by uni1F85 ;
        sub less quotesingle A bar by uni1F8D ;
        sub less quotesingle h bar by uni1F95 ;
        sub less quotesingle H bar by uni1F9D ;
        sub less quotesingle w bar by uni1FA5 ;
        sub less quotesingle W bar by uni1FAD ;
        sub less asciitilde a bar by uni1F87 ;
        sub less asciitilde A bar by uni1F8F ;
        sub less asciitilde h bar by uni1F97 ;
        sub less asciitilde H bar by uni1F9F ;
        sub less asciitilde w bar by uni1FA7 ;
        sub less asciitilde W bar by uni1FAF ;
        sub grave a bar by uni1FB2 ;
        sub a bar by uni1FB3 ;
        sub quotesingle a bar by uni1FB4 ;
        sub grave h bar by uni1FC2 ;
        sub h bar by uni1FC3 ;
        sub quotesingle h bar by uni1FC4 ;
        sub grave w bar by uni1FD2 ;
        sub w bar by uni1FD3 ;
        sub quotesingle w bar by uni1FD4 ;
        sub asciitilde a by uni1FB6 ;
        sub asciitilde a bar by uni1FB7 ;
        sub asciitilde h by uni1FC6 ;
        sub asciitilde h bar by uni1FC7 ;
        sub asciitilde w by uni1FD6 ;
        sub asciitilde w bar by uni1FD7 ;
        sub greater r by uni1FE4 ;
        sub less r by uni1FE5 ;
        sub less R by uni1FEC ;
} GreekBabelLookupMultiple ;

feature grbl {

    script DFLT ;
        language dflt ;
            lookup GreekBabelLookupMultiple ;
            lookup GreekBabelLookupSimple ;

    script latn;
        language dflt ;
            lookup GreekBabelLookupMultiple ;
            lookup GreekBabelLookupSimple ;
} grbl ;

greek-babel.pdf
Description: Adobe PDF document

#!/usr/bin/perl -W
# Outputs GSUB rules for replacing Babel-inputted greek characters with their
# Unicode value.
# In Adobe Feature Language, suitable for use in Fontlab's .fea files.

use strict ;
use utf8 ;

# Character types: breathings, accents, vowels
# The void string is considered an accent for convenience with breathings
my %charmask ;
my $charshift = 8 ;
my @breathings = ('greater', 'less') ;
my @accents = ('', 'grave', 'quotesingle', 'asciitilde') ;
my @vowels = ('a', 'e', 'h', 'i', 'o', 'u', 'w') ;

# Unicode masks for characters with breathings
$charmask{''} = 0 ;
$charmask{'greater'} = 0 ;
$charmask{'less'} = 1 ;
$charmask{'grave'} = 2 ;
$charmask{'quotesingle'} = 4 ;
$charmask{'asciitilde'} = 6 ;
$charmask{'a'} = 0x1F00 ;
$charmask{'e'} = 0x1F10 ;
$charmask{'h'} = 0x1F20 ;
$charmask{'i'} = 0x1F30 ;
$charmask{'o'} = 0x1F40 ;
$charmask{'u'} = 0x1F50 ;
$charmask{'w'} = 0x1F60 ;

# Local variables
my $breathing ; my $accent ; my $vowel ;
my $uchar ;

# First the U+1F00–U+1F6F sequence: breathing accent vowel
# We compile the Unicode code points by simply ORing the mask of each element
# Note that some of these characters actually don't exist!
# But is was easier this way (we can always edit the output afterward)
foreach $breathing (@breathings)
{
  foreach $accent (@accents)
  {
    foreach $vowel (@vowels)
    {
      # Space cadet input scheme ;-)
      $uchar = $charmask{$breathing} | $charmask{$accent} | $charmask{$vowel} ;
      printf "sub $breathing $accent $vowel by uni%04X ;\n", $uchar ;

      # Uppercase characters: the same shifted 8.
      $uchar = $charmask{$breathing} | $charmask{$accent}
        | $charmask{$vowel} | $charshift ;
      printf "sub $breathing $accent %s by uni%04X ;\n", uc($vowel), $uchar ;
    }
  }
}

# The U+1F7x range: lowercase vowels with only one accent.
# I have no idea why Unicode decided to put them there ... (especially seen as
# the uppercase vowels are somewhere else, and in an even more clumsy
# arrangement).

# We have to change the masks
$charmask{'grave'} = 0 ;
$charmask{'quotesingle'} = 1 ;
$charmask{'a'} = 0x1F70 ;
$charmask{'e'} = 0x1F72 ;
$charmask{'h'} = 0x1F74 ;
$charmask{'i'} = 0x1F76 ;
$charmask{'o'} = 0x1F78 ;
$charmask{'u'} = 0x1F7A ;
$charmask{'w'} = 0x1F7C ;

foreach $vowel (@vowels)
{
  foreach $accent ('grave', 'quotesingle')
  {
    $uchar = $charmask{$accent} | $charmask{$vowel} ;
    printf "sub $accent $vowel by uni%04X ;\n", $uchar ; }
}

# As announced before, the uppercase counterparts of these 14 characters are in
# a delighfully crappy mess. Simply output them one by one.
print "sub grave A by uni1FBA ;\n" ;
print "sub quotesingle A by uni1FBB ;\n" ;
print "sub grave E by uni1FC8 ;\n" ;
print "sub quotesingle E by uni1FC9 ;\n" ;
print "sub grave H by uni1FCA ;\n" ;
print "sub quotesingle H by uni1FCB ;\n" ;
print "sub grave I by uni1FDA ;\n" ;
print "sub quotesingle I by uni1FDB ;\n" ;
print "sub grave U by uni1FEA ;\n" ;
print "sub quotesingle U by uni1FEB ;\n" ;
print "sub grave W by uni1FFA ;\n" ;
print "sub quotesingle W by uni1FFB ;\n" ;

# U+1F80–U+1FAF: characters with subscribed iotas and breathings.
# We have to change the masks once again.
$charmask{'grave'} = 2 ;
$charmask{'quotesingle'} = 4 ;
$charmask{'a'} = 0x1F80 ;
$charmask{'h'} = 0x1F90 ;
$charmask{'w'} = 0x1FA0 ;
foreach $breathing (@breathings)
{
  foreach $accent (@accents)
  {
    foreach $vowel ('a', 'h', 'w') # Only these three vowels!
    {
      $uchar = $charmask{$breathing} | $charmask{$accent} | $charmask{$vowel} ;
      printf "sub $breathing $accent $vowel bar by uni%04X ;\n", $uchar ;

      # Uppercase counterparts
      $uchar = $charmask{$breathing} | $charmask{$accent}
        | $charmask{$vowel} | $charshift ;
      printf "sub $breathing $accent %s bar by uni%04X ;\n", uc($vowel), $uchar 
;
    }
  }
}

# And finally, the characters with subscribed iotas but without breathings.
# Only nine of them, write them one by one.
print "sub grave a bar by uni1FB2 ;\n" ;
print "sub a bar by uni1FB3 ;\n" ;
print "sub quotesingle a bar by uni1FB4 ;\n" ;
print "sub grave h bar by uni1FC2 ;\n" ;
print "sub h bar by uni1FC3 ;\n" ;
print "sub quotesingle h bar by uni1FC4 ;\n" ;
print "sub grave w bar by uni1FD2 ;\n" ;
print "sub w bar by uni1FD3 ;\n" ;
print "sub quotesingle w bar by uni1FD4 ;\n" ;

# And some more with perispomeni ...
print "sub asciitilde a by uni1FB6 ;\n" ;
print "sub asciitilde a bar by uni1FB7 ;\n" ;
print "sub asciitilde h by uni1FC6 ;\n" ;
print "sub asciitilde h bar by uni1FC7 ;\n" ;
print "sub asciitilde w by uni1FD6 ;\n" ;
print "sub asciitilde w bar by uni1FD7 ;\n" ;

# Rhos
print "sub greater r by uni1FE4 ;\n" ;
print "sub less r by uni1FE5 ;\n" ;
print "sub less R by uni1FEC ;\n" ;

# We leave some over but that should already be useful. Enjoy!

___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : [email protected] / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________

Re: [NTG-context] Greek in luatex

Reply via email to