Hi Arthur,

first of all: thank you so much for your time and your expertise!  
Your reply and your scripts really make things a lot clearer for me;  
this is a huge step forward! I'll have to experiment and think more  
about it, here's just a few reactions to some of your remarks:

On Sep 13, 2007, at 3:15 AM, Arthur Reutenauer wrote:

>       Hello Thomas,
>
>   I was waiting for someone else to answer your questions because I
> had no clue how to address them even if I was interested; but now I  
> do,
> thanks to Hans' reply:
>
>   For your general problem you need to define a new regime that will
> map each relevant character sequence to the corresponding Unicode
> character.  That is, you inform ConTeXt that the character stream  
> it sees
> is actually a way of coding another set of characters and that it can
> forget the original stream.  This treatment should be done before  
> any sort
> of font property intervenes, because it does not depend on the
> appearance of the typeset text.  That's what regimes are for.

I agree that this would probably be the cleanest solution: since  
luatex has unicode support, map everything to the corresponding  
Unicode characters. This would also make hyphenation easier to achieve.

>
>   Now I turn up to Hans to give us guidelines on how to define an
> advanced regime in Mark IV: Hans, what we need here is to replace
> sequences of characters by other characters, so the mapping is not
> one-to-one and it's more complicated than simple regimes defined by a
> table lookup; but I guess all we have to do is write a lua function  
> that
> we could plug into the input stream reading routine (just like other
> regimes work).
>
>   As far as the rest of Hans' reply is concerned (Opentype features  
> and
> such), I would like to add that it is a very interesting and  
> fascinating
> thing to do, but definitely not what you want here, for a lot of
> reasons: Opentype features can be used to alter the appearance of the
> text, but the not nature of characters themselves.  That is, if you  
> did
> the transformation of your input stream at the font level, you would
> actually tell ConTeXt that you are handling Latin characters with a
> special appearance (that the font takes care of), so for example, the
> underlying text in a PDF would be a stream of Latin characters, and
> copying-and-pasting would yield Latin characters, not Greek.

The question of copy-and-paste is one of the big mysteries, and I  
have no clue why it works in some cases, but not in others. Right  
now, on my system (OS X 10.4), only Adobe Reader 8.0 does copy-paste  
correctly, and it does it correctly no matter if I use babel or  
Unicode input. Never touch a running system: I just take this as  
some  sort of divine favor and leave it at that...

> That is
> not what you want here: you want your "a" to be understood as "alpha"
> and your "less-than acute-sign w vertical-bar" to be considered an
> "omega with dasia, varia and subscribed iota".  Nor should you  
> think of
> these transformations as a collection of ligatures (which act at the
> font level), but rather as a text encoding, just like UTF-8 is an
> encoding of the Unicode characters: in UTF-8 the byte sequence
> "hexadecimal byte E1, hexadecimal byte BC, hexadecimal byte 80" is the
> coding for the Unicode character U+1F00 GREEK SMALL LETTER ALPHA  
> WITH PSILI,
> and in the Babel input scheme for Ancient Greek the same character is
> encoded with the byte sequence "hexadecimal byte 3C [ASCII '<'],
> hexadecimal byte 61 [ASCII 'a']".

Yes, that's crystal clear. It would also take care of another  
problem: in the input stream, you know exactly which character  
sequence translates to what. On the font level, legacy fonts  
sometimes have their own ideas about where to put certain glyphs.

>
>   Of course in the past, these transformations were handled at the  
> font
> level and sequences like "< a" were actually ligatures, because  
> that was
> all we had (and copypasting from a PDF was, mostly, doomed to  
> fail); but
> we should not persist in that use now we can treat them as real  
> Unicode
> characters.

Well yes, but see above.

>
>   As for your other question in your original message from  
> September 1st
> (remapping single characters, for example U+03C3 to U+03F2), I have to
> say first that I'm not very comfortable commenting on it since I'm not
> quite sure what the issues are here; it may be that you have a simple
> variant of some character, and this you should handle at font level
> (some glyph being transformed into some other one); but if I am to  
> judge
> by the very example you gave, I would deem this should be a part of  
> your
> input regime: indeed, if every sigma is to be mapped to lunate sigma,
> then it probably means that the lunate sigmas are part of your  
> character
> stream (even if you didn't input it directly).  But I really can't  
> give
> any general advice here, especially because I don't actually know  
> what a
> lunate sigma really is ;-)  You would have to decide for yourself as a
> specialist of Greek if you're dealing with really different characters
> or simple font variants; in the former case you should handle the
> transformation as a part of your regime; in the latter, by defining a
> font feature like Hans demonstrated.

I guess that different sorts of users would respond differently. In  
Unicode, there's a different slot for some alternate characters, so  
the Unicode standard really considers them different characters. For  
the classicist, a sigma is a sigma, and the fact that it can be  
rendered as a "lunate" or a "normal" sigma is irrelevant. For me,  
this makes more sense, so  I would support this on the font level.

>
>   But for now, as long as it is understood that font tricks aren't the
> general solution for the problem at stake, I would like to demonstrate
> that it is still possible to do everything at font level :-)
>
>   If you have a look at the attached greek-babel.tex (and the features
> definition file greek-babel.fea) you will see that (almost) everything
> is taken care of using Opentype substitutions.  You need Bosporos and
> GFS Baskerville to compile the file; by the way, the line with GFS
> Baskerville is a further proof that you shouldn't handle the
> transformation at font level: can you explain why it doesn't work  
> here?
> As a compliment, I also attach the Perl script which I wrote to  
> generate
> the .fea file.

Wonderful! I will look carefully at these files. I've been playing  
with perl and python all day yesterday for another problem, so I'm  
very much looking forward to studying your script.

Thanks so much, and all best

Thomas
___________________________________________________________________________________
If your question is of interest to others as well, please add an entry to the 
Wiki!

maillist : ntg-context@ntg.nl / http://www.ntg.nl/mailman/listinfo/ntg-context
webpage  : http://www.pragma-ade.nl / http://tex.aanhet.net
archive  : https://foundry.supelec.fr/projects/contextrev/
wiki     : http://contextgarden.net
___________________________________________________________________________________

Reply via email to