Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-18 Thread Egmont Koblinger via Unicode
On Sun, Feb 17, 2019 at 1:59 PM Philippe Verdy  wrote:

> Resist this idea, I've not been impolite.

I didn't say a word about you being impolite. I said I might be
impolite for not wishing to continue this discussion in that
direction.

> I just want to show you that terminals are legacy environments

You might have missed the thread's opening mail where I mentioned that
I've been developing a terminal emulator for five years. So I'm not
sure what you exactly want to show me about what a legacy environment
it is; I think I perfectly know it.

> that are far behind what is needed for proper internationalization

For many languages (or should I say scripts) internationalization is
pretty well solved in terminals. For others, requiring LTR complex
rendering, so-so. For RTL scripts it's a straight disaster, an
application can't even count on the letters of a word showing up in
the expected order, no matter what it does.

My work fixes the latter only, within(!) the limitations of this
legacy environment. I don't find it feasible to get rid of this legacy
(the concept of strict grid), and I find it a waste of time to ponder
about it.

Not sure why after about 200 mails on the topic, I still have a hard
time getting this message through. Seems to me that folks here on the
Unicode list want everything to be perfect for all the scripts at once
and not compromise to the slightest bit; and don't really appreciate
work that only offers partial improvement due to a special context's
constraints. This is something I didn't expect when I posted to this
list.

At this point I think I've gathered all the actionable positive
feedback I could (two issues: one is that shaping needs to be done
differently, and the other one is that the paragraph direction should
be detected on larger chunks of data (at least optionally) – thanks
again for them, I'll rework my spec accordingly). For all the rest,
irrelevant and hopeless stuff, like switching to proportional fonts,
IMO it's high time we let this thread end here.


cheers,
egmont



Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-17 Thread Philippe Verdy via Unicode
Le ven. 8 févr. 2019 à 13:56, Egmont Koblinger  a écrit :

> Philippe, I hate do say it, but at the risk of being impolite, I just
> have to.
>

Resist this idea, I've not been impolite. I just want to show you that
terminals are legacy environments that are far behind what is needed for
proper internationalization. And when I exposed the problem of monospaced
fonts, and exposed the case of "dualspace" fonts, this is already used in
legacy terminals to solve practical problems (and there are even data in
the UCD about them): dualspace is an excellent solution that should be
extended even outside CJK contexts (for example with emojis, and various
other South Asian scripts).


Re: Bidi paragraph direction in terminal emulators

2019-02-14 Thread Philippe Verdy via Unicode
Le mar. 12 févr. 2019 à 14:16, Egmont Koblinger via Unicode <
unicode@unicode.org> a écrit :

> > There is nothing magic about the grid of cells, and once you introduce
> new escape sequences, you might as well truly modernise the terminal.
>
> The magic about the grid of cells is all the software that were built
> up with this assumption during the last couple of decades.
>

The minimum to support (which is already used in VT* terminals) needs to
include support "dualspace" rendering (i.e.characters rendered in one or
two cells), widely used for CJK (half-width and fullwidth characters). If
the terminal has square cells only one variant is needed (i.e. a monospace
cell), but common terminals today use rectangular cells.
Thanks Unicode has properties about that, allowing controls to select the
appropriate variant (plus legacy encodings for parts of
Latin/Greek/Cyrillic).
But the extension would be needed for other scripts. And a control in the
VT* protocol to select the variant (which would take effect in terminals
configured in dualspace rendering mode which is normally the default mode
in East Asia). This should apply to other South Asian scripts and most
emojis, and adding some control would extend the dualspace rendering to
cover the whole Unicode (without having to use the few compatibility
characters specifically encoded at end of the BMP).
Unfortunately Unicode still does not have any standard variant selector (or
other format control) to control that at least at cluster level.
This would mean adding some custom escape sequence to the VT* protocol
(using the compatibility characters for half-width/fullwidth should be
deprecated), which would be also more efficient than having to use variant
selector or format controls after each character (this solution works for
isolated characters) or having to configure the terminal in ugly monospace
mode (with typically 40 cells by line instead of 80) which is only fine for
CJK, or for output to old analog TV with very low vertical resolution
(below ~400 pixels with cells about 8x8 pixels at most) such as old CGA,
Teletext, and early 8-bit personal computers.


Re: Bidi paragraph direction in terminal emulators

2019-02-13 Thread Egmont Koblinger via Unicode
On Tue, Feb 12, 2019 at 9:35 PM Richard Wordingham via Unicode
 wrote:

> Bash already seems to handle proportional fonts quite well when run
> under Emacs 'M-x shell',

Having never used bash inside Emacs's shell, here's my experience
after about a minute of trying it:

Cursor keys allow you to walk back to the prompt, backspace allows to
delete the prompt, typing letters lets you modify the prompt... Not
something that I consider a sensible behavior.

If I do so, I have no idea what the executed command will be. Coloring
gives some clue, but isn't always reliable. My prompt is blue, the
text I type after that is black. I type one letter and then press
Ctrl-T to transpose the last two letters (the trailing space of my
prompt, and the newly typed letter). The newly typed letter is black.
I press Enter, this one-letter command isn't executed, and becomes
blue.

I feel magnitudes safer in standard bash where I know it doesn't allow
me to walk back to the prompt, only allows me to edit whatever I'm
trying to execute.

I have not studied how this behavior is implemented, but as per [1] as
well as the behavior I experience, it seems that lot of bash's
behavior wrt. line editing is moved to Emacs itself. Pretty much none
of my preferred shortcuts work as they do in native bash, something
I'm not happy about either.

I've no idea how this (external editing) would be expected to be the
generic behavior when there's no Emacs (no external editor) in the
game, plus a whole bunch of other utilities are expected to run (ones
that fail big time in Emacs's M-x shell, or even refuse to start up).

[1] https://www.gnu.org/software/emacs/manual/html_node/emacs/Shell-Mode.html


Re: Bidi paragraph direction in terminal emulators

2019-02-12 Thread Richard Wordingham via Unicode
On Tue, 12 Feb 2019 13:50:00 +0100
Egmont Koblinger via Unicode  wrote:

> For
> starter, I'd love to see a shell with interactive line editing (like
> bash, zsh),...

Bash already seems to handle proportional fonts quite well when run
under Emacs 'M-x shell', which is more than can be said for bash on
Gnome-terminal or an Emacs terminal!  In the latter two, it cannot
synchronise text display and cursor position. 

Richard.


Re: Bidi paragraph direction in terminal emulators

2019-02-12 Thread Egmont Koblinger via Unicode
Hi Elias,

> For all the willingness to come up with ways to modernise the terminal, 
> you've only spoken about trying to showhorn rtl text in to the vt102 basic 
> terminal.

Yes, addressing BiDi was the exact thing that I did now. What's wrong with that?

I can't address all the imperfectnesses at once. If you take a look at
VTE's changelog, you'll see that I've done a lot more than this, and
chances are this won't be my last improvement either.

> What I mean is that f you're willing to go as far as introducing new escape 
> codes to allow applications to better control the behaviour of this one 
> feature, why do you stop there? Why still limit yourself to the bonds of 
> vt102?

Did I stay I'll stop here? No, I presented one step, without saying
anything about what might be the next one I tackle. (Okay, I drafted
out some ideas for continuing this work, and I said things about what
will definitely _not_ be the next step, as far as I'm concerned.)

> Once you take that first step towards the new control codes, why not simply 
> come up with a new scheme? Why not let me do:
>
> TERM=newfancything
>
> And then I'd have a system that supports everything I need: variable with 
> fonts, proper rtl text, pixel-precise character positioning, all the colours, 
> inline graphics, etc.

Because this would create a brand new world where practically every
application has to be heavily adjusted, if not built up from scratch
(e.g. for ncurses, I'd expect that a new replacement would have to be
designed and created).

Because this is not solely an engineering kind of task, but rather
something that would need buy-in from a critical set of people (the
maintainers of all these libs and apps, and the other popular
terminals), which I find unlikely to get, given that for most of these
apps the current platform is good enough, and something new would add
an significant amount of extra burden for marginal benefits.

Because, even if everyone supported the idea, the required amount of
design and implementation work would be magnitudes bigger than for
BiDi.

Because I'm doing one thing at a time. And I honestly just because I
came here to announce my work that addresses _one_ thing, I really
don't find it a fair question to ask why I didn't address suddenly
magnitudes more than that.

Because I'm doing this as a hobby project, not as a paid job. If
someone offers me a job to do this, we can discuss it.

> There is nothing magic about the grid of cells, and once you introduce new 
> escape sequences, you might as well truly modernise the terminal.

The magic about the grid of cells is all the software that were built
up with this assumption during the last couple of decades.


cheers,
egmont


Re: Bidi paragraph direction in terminal emulators

2019-02-12 Thread Egmont Koblinger via Unicode
Hi Philippe,

> The monospace restriction is a strong limitator: but then I don't see why a 
> "terminal" could not handle fonts with variable metrics, and why it must be 
> modeled only as a regular grid of rectangular cells (all of equal size) 
> containing only one "character" (or cluster?).

Because this is what a "terminal" currently is, this is one of the
basic assumptions around which gazilliions of libraries and
application were built up.

Just one example: A utility might query the width, let's say it's 80
columns. Then it can print either 81 "i"s, or 81 "w"s, and in both
cases it can be sure that the last one will be aligned exactly below
the first one.

You can sure change this. But then you'll have to heavily adjust the
behavior of all the screen drawing libraries and all the applications
that use these libraries or do their own screen handling. It's out of
the scope of my work to do anything like this. If you feel like, I
encourage you to go ahead, put your work in it, and present a proof of
concept.

> So using controls, you would try to mimic again what HTML already provides 
> you for free (and without complex specifications and redevelopment).

Show me that "without complex specifications and redevelopment"
because all I see is the need to heavily rewrite plenty of libs and
tools that were created and continuously developed during the last few
decades. I don't really see this approach feasible. Feel free to prove
me wrong by presenting software that works on top of the redefined
terminal emulator concept, at least on a proof on concept level. For
starter, I'd love to see a shell with interactive line editing (like
bash, zsh), and one application that uses vertical alignment heavily,
let's say "top" or anything similar, using proportional font in your
newly created world.


cheers,
egmont



Re: Bidi paragraph direction in terminal emulators

2019-02-10 Thread Elias Mårtenson via Unicode
On Sun, 10 Feb 2019, 18:39 Egmont Koblinger via Unicode  On Sun, Feb 10, 2019 at 2:57 AM Richard Wordingham via Unicode
>  wrote:
>
> > Which side do you align RTL cells on?
>
> It's out of the scope of my docs.
>
> In the current work-in-progress implementation I align them to the
> left, but there's a TODO entry to align them to the right instead (or
> maybe center all the glyphs).
>

For all the willingness to come up with ways to modernise the terminal,
you've only spoken about trying to showhorn rtl text in to the vt102 basic
terminal.

What I mean is that f you're willing to go as far as introducing new escape
codes to allow applications to better control the behaviour of this one
feature, why do you stop there? Why still limit yourself to the bonds of
vt102?

Once you take that first step towards the new control codes, why not simply
come up with a new scheme? Why not let me do:

TERM=newfancything

And then I'd have a system that supports everything I need: variable with
fonts, proper rtl text, pixel-precise character positioning, all the
colours, inline graphics, etc.

There is nothing magic about the grid of cells, and once you introduce new
escape sequences, you might as well truly modernise the terminal.

Regards,
Elias

>


Re: Bidi paragraph direction in terminal emulators

2019-02-10 Thread Richard Wordingham via Unicode
On Sun, 10 Feb 2019 14:54:39 +0100
Philippe Verdy via Unicode  wrote:

> Le sam. 9 févr. 2019 à 20:55, Egmont Koblinger via Unicode <
> unicode@unicode.org> a écrit :  
> 
> > Hi Asmus,
> >  
> > > On quick reading this appears to be a strong argument why such
> > > emulators  
> > will  
> > > never be able to be used for certain scripts. Effectively, the
> > > model  
> > described works  
> > > well with any scripts where characters are laid out (or can be
> > > laid out)  
> > in fixed  
> > > width cells that are linearly adjacent.  
> >
> > I'm wondering if you happen to know:
> >
> > Are there any (non-CJK) scripts for which a mechanical typewriter
> > does not exist due to the complexity of the script?
> >  
> 
> Look into South Asian scripts (Lao, Khmer, Tibetan...) and...

The Khmer script is an interesting case - see
http://onkhmertype.com/the-cambodian-typewriter.  The problem there is
that deep cells are needed.  What's the VTE algorithm for the vertical
extent of the cell?

The only problem I can see for Lao is that there can be two marks below
a consonant. Otherwise, a straightforward adaptation of a Thai
typewriter should suffice.

There's a Tai Tham typewriter in the National Museum in Bangkok.
However, spelling may have been adapted to cope with any limitations.

>... large syllabaries (CANS, Ethiopian).

That's more a matter of extent than complexity.

Sesquidimensional Egyptian hieroglyphs could be tricky - they'll be like
producing 2-D renderings of ideographic description sequences.

There could be a problem with standardising cuneiform character widths.

Richard.



Re: Bidi paragraph direction in terminal emulators

2019-02-10 Thread Philippe Verdy via Unicode
Le sam. 9 févr. 2019 à 20:55, Egmont Koblinger via Unicode <
unicode@unicode.org> a écrit :

> Hi Asmus,
>
> > On quick reading this appears to be a strong argument why such emulators
> will
> > never be able to be used for certain scripts. Effectively, the model
> described works
> > well with any scripts where characters are laid out (or can be laid out)
> in fixed
> > width cells that are linearly adjacent.
>
> I'm wondering if you happen to know:
>
> Are there any (non-CJK) scripts for which a mechanical typewriter does
> not exist due to the complexity of the script?
>

Look into South Asian scripts (Lao, Khmer, Tibetan...) and large
syllabaries (CANS, Ethiopian).
Even Arabic is challenging and does not work very well (or is very ugly)
with typewriters or monospaced fonts, except if we use "simplified" Arabic.
Hebrew is a bit better but also has issues if you need to support all its
diacritics.

Finally even Latin is not easy to fit with its ligatures, and multiple
diacritics, some of them with complex layouts and applicable to pairs of
letters, or seomtimes larger groups).
The monospace restriction is a strong limitator: but then I don't see why a
"terminal" could not handle fonts with variable metrics, and why it must be
modeled only as a regular grid of rectangular cells (all of equal size)
containing only one "character" (or cluster?). It is perfectly possible to
have a terminal handling text as collection of "logical lines", split
(horizontally?) as multiple spans covering one or more cells, each span
containing one or more characters (or a full cluster) rendered correctly.

But then you recreate the basic HTML standard (just discard the "document"
and "body" level which would be implicit in a terminal, keep the "block"
and "inline" elements, and flow the text (note that rendered lines could as
well variable heights, depending on the height of their unbreakable spans
and their vertical alignment...). But then you need specific controls to
make proper vertical alignments (basically you need a "tabulator" in the
terminal with a way to define the start of a tabulator scope and its end,
and then reference tabulations by id when defining them in the middle of
the text; this tabulator would be more powerful than just the TAB control
which only uses an implicit/predefined tabulator).

Then for editors in terminals you need a way to query the position of some
items and make "logical" moves: the simple (line/column) coordinates on a
grid are not usable. In HTML we would do that with form input elements (the
form is flowed normally but is navigatable and input elements will have
their own editable areas).

So using controls, you would try to mimic again what HTML already provides
you for free (and without complex specifications and redevelopment).

So my opinion is that all legacy terminal protocosl will remain broken and
it is more viable to work with the W3C to define a basic HTML profile
suitable for terminals, but that will benefit of all the improvements made
in HTRML to support i18n, including required ones (BiDi, variable-width
fonts needed for complex scripts, accessibility...), but without the extra
elements that were added in HTML5 for semantic document structures (HTML5
still speaks about the "document" level, but there's little defined for
documents that are infinite streams that you can start reading from random
position and possibly never terminated):

All we need is a subset of HTML5 with only a few block elements without
terminator tags ("p" would be implicit) and the inline elements for all the
rest, and this becomes a viable "terminal protocol" which would deprecate
all the legacy VT-like protocols (and would put an end to the desire of
adding many new controls or duplicate reencodings in Unicode for specific
styles.

The only block elements that would be useful on top of this are forms and
form inputs, to create editable fields and some attributes to allow editing
or disallow them. Scripting would be an option (only for local data
validation or filtering some inputs that must not be sent to the server, or
to allow accessibility features, input methods and orthographic helpers).
Then with that we are no longer blocked by the old terminal limitations
(but it will still be possible for a terminal emulator to create a
reasonnable layout to map it to a grid-based terminal, and then offer some
helper tools to show a selectable popup view for things that cannot be
rendered on the basic grid).


Re: Bidi paragraph direction in terminal emulators

2019-02-10 Thread Egmont Koblinger via Unicode
On Sun, Feb 10, 2019 at 2:57 AM Richard Wordingham via Unicode
 wrote:

> Which side do you align RTL cells on?

It's out of the scope of my docs.

In the current work-in-progress implementation I align them to the
left, but there's a TODO entry to align them to the right instead (or
maybe center all the glyphs).


e.


Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Richard Wordingham via Unicode
On Sun, 10 Feb 2019 00:59:46 +0100
Egmont Koblinger via Unicode  wrote:

> Is there such a monospace font obeying wcwidth (that is: double wide
> character for when a spacing mark is combined) for Devanagari?

For CV, that would correspond to a Hindi typewriter, so the odds look
good. The Remington keyboard layout is taken from the typewriter
design.  However, the typewriter had non-spacing keys for repha
(roughly ) and vattu (), so you'll be out of luck
for consonant clusters.  On the other hand,  is two
key strokes - the cells would be for  and
!  There's an implementation of the keyboard in the M17N
database - hi-remington.mim.

> Is there a monospace font for Arabic,

Apart from wcwidth("لآ") = ‎2, Khaled has already said in this thread
that there are such fonts.

> for Syriac, etc.? (How much do these questions make sense at all?)

Perfect sense.

Richard.



Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Richard Wordingham via Unicode
On Sat, 9 Feb 2019 18:42:52 +0100
Egmont Koblinger via Unicode  wrote:


> The
> problem that I don't know how to address is: What if harfbuzz tells us
> that the overall width for rendering a particular grapheme cluster is
> significantly different from its designated area (the number of
> character cells [wcswidth()] multiplied by the width of each)?

You have to reduce the width of the glyph used.  The tricky bit is
where the glyph deliberately overhangs or underlies a neighbouring
glyph.  A good example of this is almost U+0E33 THAI CHARACTER SARA AM,
whose nikkhahit component can typically overhangs the previous
character; however, ink beyond the left limit should not be a problem
for LTR scripts. Which side do you align RTL cells on?

Now, you might want to treat U+0E33 as interacting with its
predecessor, because it does. The test word is น้ำ 'water'.

Richard.



Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Richard Wordingham via Unicode
On Sat, 9 Feb 2019 22:29:31 +0100
Adam Borowski via Unicode  wrote:

> On Sat, Feb 09, 2019 at 10:01:21PM +0200, Eli Zaretskii via Unicode
> wrote:

> > I don't know.  Maybe it keeps a database of character combinations
> > that need shaping, each one with the maximum width on display the
> > result can occupy.  Or maybe it does something else.  If it cannot,
> > and the terminal cannot either, then what you say is that some
> > scripts can never be supported by text terminals.  
> 
> That's doable even within the current rules, where every codepoint
> bears a wcwidth of 0, 1 or 2.  A cluster made of codepoints a ' b c d
> " ^ (where a b c d have widths 1 while ' " ^ widths 0) needs to be
> rendered in exactly 4 cells.  This may force stretching or condensing
> the shaped cluster compared to what usual typography would demand but
> that's in no way different from stretching Latin "i" or condensing
> "W".

It would be helpful if overlong shapings were condensed automatically.

The general principle that functions work better on strings applies
here.  There are two obvious situations where the additive formulae
break down.

(a) Emoji should, should they not, occupy at least 2 cells.  There are
a few problem sequences, such as  (or is
wcwidth(0x20E3) equal to 1?).

(b) Brahmi-like Indic scripts.  In many of these, the combination of a
virama or invisible stacker and a base consonant acts like a combining
mark, either causing no advance or as a mark with a very slight width.
Examples include Grantha, Myanmar, Tai Tham and Khmer.

Stretching a stack of 3 or 4 consonants to occupy 3 or 4 cells instead
of 1 would be worse than stretching 'i'.  If you do it, you want fonts
that adjust the glyphs accordingly, just as for 'i'.

Richard.


Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Egmont Koblinger via Unicode
Hi,

On Sun, Feb 10, 2019 at 12:52 AM Richard Wordingham via Unicode
 wrote:

> This is an example of where one needs a font designed for terminal
> emulators.

Definitely, this is another approach I forgot to mention in my mail,
rather than VTE switching to harfbuzz and figuring out all the issues.
This approach would also make them usable in every decent terminal
emulator at once, not just VTE.

Is there such a monospace font obeying wcwidth (that is: double wide
character for when a spacing mark is combined) for Devanagari? Is
there a monospace font for Arabic, for Syriac, etc.? (How much do
these questions make sense at all?)

If there are such fonts, I'd be happy to use them for testing.

e.


Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Richard Wordingham via Unicode
On Sat, 9 Feb 2019 22:31:37 +0100
Egmont Koblinger via Unicode  wrote:

> Let's take the Devanagari improvement of the other day. Until now,
> there were plenty of dotted circles shown, and combining spacing marks
> that should've been placed before the letter were placed after the
> letter, before a placeholder dotted circle. Now they are displayed as
> expected: the combininig spacing mark shows up before the letter (if
> it's of that kind), and no dotted circle. The letter + spacing marks
> now shows up correctly. The entire word still doesn't, e.g. there are
> often spaces between letters where the upper line connecting them
> should be continuous.

This is an example of where one needs a font designed for terminal
emulators.

Richard.


Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Richard Wordingham via Unicode
On Sat, 9 Feb 2019 13:02:55 -0800
"Asmus Freytag \(c\) via Unicode"  wrote:

> To force Hindi crosswords mode you need to segment the string into 
> syllables,
> each having a variable number of characters, and then assign a single 
> display
> position to them. Now some syllables are wider than others, so you
> could use the single/double width paradigm. The result may be
> somewhat legible for Devanagari, but even some of the closely related
> scripts may not fit that well.

It is also possible that whole syllables are used because there are
vertical words.

> To give you an idea, here is an Arabi crossword. It uses the isolated 
> shape of
> all letters and writes all words unconnected. That's two things that
> may be acceptable for a puzzle, but not for text output.
> 
> http://www.everyday-arabic.com/2013/12/crossword1.html
> 
> (try typing 3 vertical as a word to see the difference - it's 4x
> U+062A)

Crosswords suffer from the need to be read vertically as well as
horizontally.  Can Arabic naturally be written vertically?

In any case, Arabic typewriters exist and, so far as I understand,
work.  The problem rather seems to be one of standardising the
Procrustean technique to be used.  It seems from what Khaled Hosny
wrote that monospace for letters is the usual solution already. 

The design difficulty for Arabic is rather that horizontally adjacency
may sometimes need to be treated as accidental rather than as an
invitation to cursively join..

Richard.


Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Asmus Freytag (c) via Unicode

On 2/9/2019 1:40 PM, Egmont Koblinger wrote:

On Sat, Feb 9, 2019 at 10:10 PM Asmus Freytag via Unicode
 wrote:


I hope though that all the scripts can be supported with more or less
compromises, e.g. like it would appear in a crossword. But maybe not.

See other messages: not.

For the crossword analogy, I can see why it's not good. But this
doesn't mean there aren't any other ideas we could experiment with.



"all...scripts" is the issue.  We know how to handle text for all 
scripts and what complexities one has to account for in order to do 
that. You can back off some corner cases or (slightly) degrade things, 
but even after you are done with that, there will be scripts where the 
"more or less compromises" forces by the design parameters you gave will 
mean an utterly unacceptable display.


That said, there are scripts that had "passable" typewriter 
implementations and it may be possible to tweak things to approach that 
level support. Don't know for sure, it depends on the details for each 
script.





Or do you mean to say that because it can't be made perfect, there's
no point at all in partially improving? I don't think I agree with
that.



It's more a question of being upfront with your goal.

At this point I understand it as accepting some design parameters as 
fundamental and seeing whether there are some tweaks that allow more 
scripts to work with or to "survive" given the constraints.


That's not a totally useless effort, but it is a far cry from Unicode's 
universal support for ALL writing systems.


A./

PS: also we have been seriously hijacking a thread related to bidi




e.





Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Egmont Koblinger via Unicode
On Sat, Feb 9, 2019 at 10:10 PM Asmus Freytag via Unicode
 wrote:

> > I hope though that all the scripts can be supported with more or less
> > compromises, e.g. like it would appear in a crossword. But maybe not.
>
> See other messages: not.

For the crossword analogy, I can see why it's not good. But this
doesn't mean there aren't any other ideas we could experiment with.

Or do you mean to say that because it can't be made perfect, there's
no point at all in partially improving? I don't think I agree with
that.



e.


Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Egmont Koblinger via Unicode
Hi Asmus,

On Sat, Feb 9, 2019 at 10:02 PM Asmus Freytag (c)  wrote:

> are you excluding CJK because of the difficulty handling a large
> repertoire with mechanical means?

No, I excluded CJK because they're pretty well solved in terminals,
and nowhere near along the lines of how they work with typewriters.

I should've probably said "letter based" scripts or whatever, I'm not
familiar with the exact terminologies.

> To force Hindi crosswords mode you need to segment the string into syllables,
> each having a variable number of characters [...]

Thanks a lot to you too for your detailed explanation!

> Are you defining as your goal to have some kind of "line by line" display that
> can survive any Unicode text thrown at it, or are you trying to extend a given
> design with rather specific limitations, so that it survives / can be used 
> with,
> just a few more scripts than European + CJK?

I don't have a clearly defined goal. I find fun in developing VTE (and
slightly improving other terminal emulators too by spreading ideas,
knowledge, comments etc.), addressing various kinds of goals, whatever
happens to come next. At this point it's BiDi, with a bit of
Devanagari improvement sneaking in the other day.

What is clear to me: I cannot redefine the basics of terminal
emulation. I can only add incremental improvements to whatever it
already is, and I have to make sure that the ecosystem built around it
during decades (all the screen handling libraries and applications)
doesn't break. I'm limited by these constraints.

> The discrepancies would be more like throwing random blank spaces in the
> middle of every word, writing letters out of order, or overprinting. So, more
> fundamental, not just "not perfect".

Let's take the Devanagari improvement of the other day. Until now,
there were plenty of dotted circles shown, and combining spacing marks
that should've been placed before the letter were placed after the
letter, before a placeholder dotted circle. Now they are displayed as
expected: the combininig spacing mark shows up before the letter (if
it's of that kind), and no dotted circle. The letter + spacing marks
now shows up correctly. The entire word still doesn't, e.g. there are
often spaces between letters where the upper line connecting them
should be continuous.

Eventually HarfBuzz could help, but it's just not yet clear how
exactly. I cannot essentially change the underlying model of fixed
width cells. On top of this model, though, we can experiment with
various ideas about displaying. For example, if a word occupies 7
columns in the model, then HarfBuzz renders it, and the rendered
version occupies the width of 8.6 columns, maybe we can squeeze it
using a trivial linear transformation? I'm not sure, but maybe it's an
idea worth investigating. Won't look perfect, but probably will look
better than what we do currently. We already have column spacing
implemented, to pull the columns further apart from each other by a
fixed amount (mostly for accessibility purposes), maybe a user can use
this feature to make more room for a nicely rendered, non-squeezed
Devanagari text.

> To give you an idea, here is an Arabi crossword. It uses the isolated shape of
> all letters and writes all words unconnected. That's two things that may be
> acceptable for a puzzle, but not for text output.

You can't get nice Arabic without first making sure the order of the
letters is the correct one, not reversed. :-) That's what my current
work is about.

As per Richard's feedback, I also see that shaping needs to be done
differently than I had thought. Mind you, my visual inspection of what
the non-preferred shaping approach gave to me vs. what a proper
HarfBuzz rendering gave (for Arabic) were extremely close to each
other, something that I'd probably consider "good enough" if I spoke
the language and were aware of the terminal's constraints. Well,
definitely a major improvement over what we have.

> You may begin to see the limitations and that they may well prevent you from
> reaching even your limited goal for speakers of at least three of the top ten 
> languages
> worldwide.

If the goal is to have perfect rendering without compromises: sure I
won't reach that. (It's not a goal for me. For perfect rendering,
users should get away from terminals.) If the goal is to have
something reasonably good, better than what we have currently, I can't
see why not.


cheers,
e.


Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Adam Borowski via Unicode
On Sat, Feb 09, 2019 at 10:01:21PM +0200, Eli Zaretskii via Unicode wrote:
> > From: Egmont Koblinger 
> > Date: Sat, 9 Feb 2019 20:36:50 +0100
> > Cc: Richard Wordingham , 
> > unicode Unicode Discussion 
> > 
> > On Sat, Feb 9, 2019 at 8:13 PM Eli Zaretskii  wrote:
> > 
> > > That's the application's problem, not the terminal's.  An application
> > > that wants its column to line up _and_ wants to support complex text
> > > scripts will need to move cursor to certain coordinates, not to assume
> > > that 7 codepoints always take 7 columns on display.

It must know that those particular 7 codepoints take, say, 5 columns when
written together in a sequence.  And it can't possibly ask the terminal,
either -- it might be on a link that doesn't allow metadata to pass, it
might be broadcasted, its output might be recorded many years prior to being
displayed.  A good part of the time the program is even run on a different
distribution/release/OS.

Obviously, a program running with system libraries might suffer misalignment
and thus visual corruption if those libraries don't know beyond, say,
Unicode 13 yet the terminal expects Unicode 17 -- but that's no different
from any other property incompatibly changing.  Property changes for
established characters are pretty rare thus no significant loss of
interoperability can be expected over time.

> > In order to do that, an application needs to know how wide a text will
> > appear, which depends on the font. How will it know it?
> 
> I don't know.  Maybe it keeps a database of character combinations
> that need shaping, each one with the maximum width on display the
> result can occupy.  Or maybe it does something else.  If it cannot,
> and the terminal cannot either, then what you say is that some scripts
> can never be supported by text terminals.

That's doable even within the current rules, where every codepoint bears a
wcwidth of 0, 1 or 2.  A cluster made of codepoints a ' b c d " ^ (where a b
c d have widths 1 while ' " ^ widths 0) needs to be rendered in exactly 4
cells.  This may force stretching or condensing the shaped cluster compared
to what usual typography would demand but that's in no way different from
stretching Latin "i" or condensing "W".


Meow!
-- 
⢀⣴⠾⠻⢶⣦⠀
⣾⠁⢠⠒⠀⣿⡁ Remember, the S in "IoT" stands for Security, while P stands
⢿⡄⠘⠷⠚⠋⠀ for Privacy.
⠈⠳⣄


Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Asmus Freytag (c) via Unicode

On 2/9/2019 11:48 AM, Egmont Koblinger wrote:

Hi Asmus,


On quick reading this appears to be a strong argument why such emulators will
never be able to be used for certain scripts. Effectively, the model described 
works
well with any scripts where characters are laid out (or can be laid out) in 
fixed
width cells that are linearly adjacent.

I'm wondering if you happen to know:

Are there any (non-CJK) scripts for which a mechanical typewriter does
not exist due to the complexity of the script?


Egmont,

are you excluding CJK because of the difficulty handling a large
repertoire with mechanical means? However, see:

https://en.wikipedia.org/wiki/Chinese_typewriter




Are there any (non-CJK) scripts for which crossword puzzles don't exist?

For scripts where these do exist, is it perhaps an acceptable tradeoff
to keep their limitations in the terminal emulator world as well, to
combine the terminal emulator's power with these scripts?



I agree with you that crossword puzzles and scrabble have a similar
limitation to the design that you sketched for us. However, take a script
that is written in syllables (each composed of 1-5 characters, say).

In a "crossword" I could write this script so that each syllable occupies
a cell. It would be possible to read such a puzzle, but trying to use 
such a draconian
technique for running text would be painful, to say the least. (We are 
not even

talking about pretty, here).

Here's an example for Hindi:
https://vargapaheli.blogspot.com/2017/
I don't read Hindi, but 5 vertical in the top puzzle, cell 2, looks like 
it contains

both a consonant and a vowel.

To force Hindi crosswords mode you need to segment the string into 
syllables,
each having a variable number of characters, and then assign a single 
display

position to them. Now some syllables are wider than others, so you could use
the single/double width paradigm. The result may be somewhat legible for
Devanagari, but even some of the closely related scripts may not fit 
that well.


Now there are some scripts where the same syllable can be written in more
than one form; the forms differing by how the elements are fused (or 
sometimes
not fused) into a single shape. Sometimes, these differences are more 
"stylistic",
more like an 'fi' ligature in English, sometimes they really indicate 
different words,
or one of the forms is simply not correct (like trying to spell lam-alif 
in Arabic using

two separate letters).

I'm sure there are scripts that work rather poorly (effectively not at 
all) in cross-

word mode. The question then becomes one of goals.

Are you defining as your goal to have some kind of "line by line" 
display that
can survive any Unicode text thrown at it, or are you trying to extend a 
given
design with rather specific limitations, so that it survives / can be 
used with,

just a few more scripts than European + CJK?



Honestly, even with English, all I have to do is "cat some_text_file",
and chances are that a word is split in half at some random place
where it hits the right margin. Even with just English, a terminal
emulator isn't something that gives me a grammatically and
typographically super pleasing or correct environment. It gives me
something that I personally find grammatically and typographically
"good enough", and in the mean time a powerful tool to get my work
done.



The discrepancies would be more like throwing random blank spaces in the
middle of every word, writing letters out of order, or overprinting. So, 
more

fundamental, not just "not perfect".

To give you an idea, here is an Arabi crossword. It uses the isolated 
shape of

all letters and writes all words unconnected. That's two things that may be
acceptable for a puzzle, but not for text output.

http://www.everyday-arabic.com/2013/12/crossword1.html

(try typing 3 vertical as a word to see the difference - it's 4x U+062A)



Obviously the more complex the script, the more tradeoffs there will
be. I think it's a call each user has to make whether they prefer a
terminal emulator or a graphical app for a certain kind of task. And
if terminal emulators have a lower usage rate in these scripts, that's
not necessarily a problem. If we can improve by small incremental
changes, sure, let's do. If we'd need to heavily redesign plenty of
fundamentals in order to improve, it most likely won't happen.

You may begin to see the limitations and that they may well prevent you 
from
reaching even your limited goal for speakers of at least three of the 
top ten languages

worldwide.

A./



Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Asmus Freytag via Unicode

  
  
On 2/9/2019 12:07 PM, Egmont Koblinger
  via Unicode wrote:


  On Sat, Feb 9, 2019 at 9:01 PM Eli Zaretskii  wrote:


  
then what you say is that some scripts
can never be supported by text terminals.

  
  
I'm not familiar at all with all the scripts and their requirements,
but yes, basically this is what I'm saying. I'm afraid some scripts
can never be perfectly supported by text terminals.



This includes the scripts used for up to four of the world's top
  ten languages.
And it's more than "not perfect"; effectively some scripts cannot
  be shoehorned
  into the fundamental design.
That design was created to work with European scripts, and proved
  somewhat
  adaptable to other scripts that lend themselves to fixed-width
  cell display. But
  beyond that is where you hit the proverbial brick wall.


  

I hope though that all the scripts can be supported with more or less
compromises, e.g. like it would appear in a crossword. But maybe not.



See other messages: not.



  

Maybe one day some new, modern platform will arise with the goal of
replacing terminal emulators, which I wouldn't necessarily mind. It's
gonna take an enormous amount of work, though.


A./



  



Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Ken Whistler via Unicode

Egmont,

On 2/9/2019 11:48 AM, Egmont Koblinger via Unicode wrote:

Are there any (non-CJK) scripts for which crossword puzzles don't exist?


There are crossword puzzles for Hindi (in the Devanagari script). Just 
do an image search for "Hindi crossword puzzle".


But the conventions for these break up words into syllables fitting into 
the boxes, and the rules for that are complex. You have to allow for the 
placement of dependent vowels, which may take up extra space left or 
right, as well as consonant clusters, which would be expressed often as 
conjuncts in Sanskrit, but which in Hindi are more commonly rendered as 
dead consonant sequences. So the "stuff in a box" is:


1. Inherently proportional width.

2. Inherently multi-character in content. (underlying 1 to 3 or more 
characters per cell)


This is the kind of compromise you would have to have to make for almost 
any Indic script, to enable a rational approach to building crossword 
puzzles that make sense.


And in a terminal context, you probably would not get acceptable 
behavior for Hindi if you tried to just take all the "stuff in a box" 
chunks and tried to lay them out directly in a line, as if the script 
behaved more like CJK.


The existence proof of techniques to cut up text into syllables that 
enable crossword puzzle building, is not the same as a determination 
that the script, ipso facto, would work in a terminal context without 
dealing with additional complex script issues.


At any rate, this is once again straying over into the issue of whether 
terminals can  be adapted for the requirements of shaping rules for 
complex scripts -- rather than the nominal subject of the thread, which 
has to do with bidi text layout in terminals.


--Ken




Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Egmont Koblinger via Unicode
Hi Ken,

> There are crossword puzzles for Hindi (in the Devanagari script). Just
> do an image search for "Hindi crossword puzzle".

It's easy to confirm the existence by an image search, it's hard to
confirm the non-existence ;)

> The existence proof of techniques to cut up text into syllables that
> enable crossword puzzle building, is not the same as a determination
> that the script, ipso facto, would work in a terminal context without
> dealing with additional complex script issues.

Thanks a lot for your detailed explanation; this possibility indeed
didn't occur to me.


cheers,
egmont


Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Egmont Koblinger via Unicode
On Sat, Feb 9, 2019 at 9:01 PM Eli Zaretskii  wrote:

> then what you say is that some scripts
> can never be supported by text terminals.

I'm not familiar at all with all the scripts and their requirements,
but yes, basically this is what I'm saying. I'm afraid some scripts
can never be perfectly supported by text terminals.

I hope though that all the scripts can be supported with more or less
compromises, e.g. like it would appear in a crossword. But maybe not.

Maybe one day some new, modern platform will arise with the goal of
replacing terminal emulators, which I wouldn't necessarily mind. It's
gonna take an enormous amount of work, though.


cheers,
egmont


Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Eli Zaretskii via Unicode
> From: Egmont Koblinger 
> Date: Sat, 9 Feb 2019 20:36:50 +0100
> Cc: Richard Wordingham , 
>   unicode Unicode Discussion 
> 
> On Sat, Feb 9, 2019 at 8:13 PM Eli Zaretskii  wrote:
> 
> > That's the application's problem, not the terminal's.  An application
> > that wants its column to line up _and_ wants to support complex text
> > scripts will need to move cursor to certain coordinates, not to assume
> > that 7 codepoints always take 7 columns on display.
> 
> In order to do that, an application needs to know how wide a text will
> appear, which depends on the font. How will it know it?

I don't know.  Maybe it keeps a database of character combinations
that need shaping, each one with the maximum width on display the
result can occupy.  Or maybe it does something else.  If it cannot,
and the terminal cannot either, then what you say is that some scripts
can never be supported by text terminals.


Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Egmont Koblinger via Unicode
Hi Asmus,

> On quick reading this appears to be a strong argument why such emulators will
> never be able to be used for certain scripts. Effectively, the model 
> described works
> well with any scripts where characters are laid out (or can be laid out) in 
> fixed
> width cells that are linearly adjacent.

I'm wondering if you happen to know:

Are there any (non-CJK) scripts for which a mechanical typewriter does
not exist due to the complexity of the script?

Are there any (non-CJK) scripts for which crossword puzzles don't exist?

For scripts where these do exist, is it perhaps an acceptable tradeoff
to keep their limitations in the terminal emulator world as well, to
combine the terminal emulator's power with these scripts?

Honestly, even with English, all I have to do is "cat some_text_file",
and chances are that a word is split in half at some random place
where it hits the right margin. Even with just English, a terminal
emulator isn't something that gives me a grammatically and
typographically super pleasing or correct environment. It gives me
something that I personally find grammatically and typographically
"good enough", and in the mean time a powerful tool to get my work
done.

Obviously the more complex the script, the more tradeoffs there will
be. I think it's a call each user has to make whether they prefer a
terminal emulator or a graphical app for a certain kind of task. And
if terminal emulators have a lower usage rate in these scripts, that's
not necessarily a problem. If we can improve by small incremental
changes, sure, let's do. If we'd need to heavily redesign plenty of
fundamentals in order to improve, it most likely won't happen.


cheers,
egmont


Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Egmont Koblinger via Unicode
On Sat, Feb 9, 2019 at 8:13 PM Eli Zaretskii  wrote:

> That's the application's problem, not the terminal's.  An application
> that wants its column to line up _and_ wants to support complex text
> scripts will need to move cursor to certain coordinates, not to assume
> that 7 codepoints always take 7 columns on display.

In order to do that, an application needs to know how wide a text will
appear, which depends on the font. How will it know it?

Will it by some means know the font and the rendering engine the
terminal uses (even across ssh) and will it have to measure it itself?

Or will it be able to ask the terminal? If so, how? Maybe a new
extension, an asynchronous escape sequence that responds back with the
measured width? What about the latency caused by the bunch of
asyncronous roundtrips, especially over ssh? What about the utter pain
and intrinsic unreliability of handling asynchronous responses, as
I've outlined in a section of
https://gitlab.freedesktop.org/terminal-wg/specifications/issues/8 ?

What if there's no font? What if there are multiple fonts at the same
time? What if the font is changed later on, is it okay then for the
display of existing stuff to fall apart and only newly printed stuff
to appear correctly?

How do you define the "width of the terminal in characters", get/set
by ioctl(..., TIOC[GS]WINSZ, ...) that many apps rely on?

If you define it by any means, what if by placing the maximum numbers
of "i"s in a row doesn't fill up the entire width? Will that area be
unaccessible, then? Or despite having a definition of terminal width,
will there be new cells beyond this width to write to?

What if filling a row with all "w"s overflows? I take it that an app
shouldn't print there, but what if it still does, will that piece of
text just not be shown?

How much more complicated would you think implementing something like
"zip -h" become?

> How is this different from using variable-pitch fonts?

Do you mean variable-pitch font where the terminal still places each
glyph in its designated area? The font is the private business of the
terminal emulator, then, it'll just appear ugly as a screenshot I've
already linked, but the emulation behavior wouldn't care.

Or do you mean variable-pitch font where each letter is placed after
each other, as you'd expect in document editors? That is, way more
"i"s that "w"s fitting in a line? It's not different, it's practically
the same. And this is something that none of the terminal emulators
I'm aware of does; and having some clue about terminal emuators, I
can't imagine how one could do (see all the questions above for a
start).

This is why I'm saying: Sure you can take this path, but then we're
talking about something new, not terminal emulators as we currently
know them. You can take this path, but then you'll have to rebuild
many of the already existing apps, and beware, they'll get way more
complex.


e.


Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Eli Zaretskii via Unicode
> From: Egmont Koblinger 
> Date: Sat, 9 Feb 2019 20:03:21 +0100
> Cc: Richard Wordingham , 
>   unicode Unicode Discussion 
> 
> Let's suppose a utility outputs these two lines of text:
> abcdefg|
> complex|
> 
> whereas "abcdefg" are these English letters themselves, but "complex"
> is a word of some language requiring complex script rendering, taking
> up 7 logical cells (because that's what wcwidth() says). Also, "|" is
> the pipe symbol, or a vertical box drawing line, whatever.
> 
> Now let's assume that harfbuzz tells you that the desired width for
> rendering this "complex" word is 5.3 times the width of the character
> cell. Or 8.6 times it. How to proceed? How will the "|" bars align up,
> and thus mc's two-panel layout, tmux's vertical split etc. not fall
> apart?  In the latter case, when the width requested by harfbuzz is
> bigger than the designated width, what to with characters that "fall
> off" at the right edge of the terminal?

That's the application's problem, not the terminal's.  An application
that wants its column to line up _and_ wants to support complex text
scripts will need to move cursor to certain coordinates, not to assume
that 7 codepoints always take 7 columns on display.  Or it will have
to tell the users to use specific fonts, which are known to provide
guarantees that this happens.

How is this different from using variable-pitch fonts?


Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Asmus Freytag via Unicode

  
  
On quick reading this appears to be a
  strong argument why such emulators will
never be able to be used for certain
  scripts. Effectively, the model described works
well with any scripts where characters
  are laid out (or can be laid out) in fixed
width cells that are linearly adjacent.


There are some crude techniques that
  allow an extension to cover scripts that
require half-width or double-width
  cells, and perhaps even zero-width.


However, scripts, where rendering
  involves complicated ligatures or other
  typographical interactions that often are specific to a given
  font, would simply 

be out of scope because for those
  scripts the fixed width model with an 

underlying buffer mimicking the display
  simply cannot be made to work.


And indeed, by up-front accepting the
  limitation of a particular design approach
it would be surprising if such
  emulators proved flexible enough to handle the
rather wide variety of writing systems
  supported by Unicode.


At best, the discussion could yield a
  few further approximations of correct
rendering that can be retrofitted to
  the particular design restrictions outlined
below, but that with luck extend the
  envelope somewhat so that a few more
writing systems can be shoehorned into
  it.


However, it appears quite hopeless to
  attempt to cover all of Unicode's scripts
on that premise.


A./









On 2/9/2019 10:25 AM, Egmont Koblinger
  via Unicode wrote:


  On Sat, Feb 9, 2019 at 7:07 PM Eli Zaretskii  wrote:


  
You need to use what HarfBuzz tells you _instead_ of wcswidth.  It is
in general wrong to use wcswidth or anything similar when you use a
shaping engine and support complex script shaping.

  
  
This approach is not viable at all.

Terminal emulators have an internal data structure that they maintain,
a matrix of character cells. Every operation is performed here, every
escape sequence is defined on this layer what it does, the cursor
position is tracked on this layer, etc. You can move the cursor to
integer coordinates, overwrite the letter in that cell, and do plenty
of other operations (like push the rest to the right by one cell). If
you change these fundamentals, most of the terminal-based applications
will fall apart big time.

This behavior has to be absolutely independent from the font. The
application running inside the terminal doesn't and cannot know what
font you use, let alone how harfbuzz is about to render it. (You can
even have no font at all, such as with the libvterm headless emulator
library, or a detached screen or tmux session; or have multiple fonts
at the same time if a screen or tmux session is attached from multiple
graphical emulators.)

So one part of a terminal emulator's code is responsible for
maintaining this matrix of characters according to the input it
receives. Another part of their code is responsible for presenting
this matrix of characters on the UI, doing the best it can.

If you say that the font should determine the logical width, you need
to start building up something brand new from scratch. You need to
have something that doesn't have concepts like "width in characters".
You need to redefine cursor movement and many other escape sequences.
You need to heavily adjust the behavior of a gazillion of software,
e.g. zip's two-column output, anything that aligns in columns (e.g.
midnight commander, tmux's vertical split etc.), the shell's (or
readline's) command editing and wrapping to multiple lines, ncurses,
and so on, all the way to e.g. fullscreen text editors like Emacs.

And then we're not talking about terminal emulators anymore, as we
know them now, but something new, something pretty different.

Terminal emulators do have strong limitations. Complex text rendering
can only work to the extent we can squeeze it into these limitations.


cheers,
egmont





  



Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Egmont Koblinger via Unicode
On Sat, Feb 9, 2019 at 7:56 PM Eli Zaretskii  wrote:

> I'm probably missing something, because I don't see the grave problems
> you hint at.  Any width provided back by a shaper can be rounded to
> the nearest integral character cell, so your canvas can still remain
> rectangular.

Let's suppose a utility outputs these two lines of text:
abcdefg|
complex|

whereas "abcdefg" are these English letters themselves, but "complex"
is a word of some language requiring complex script rendering, taking
up 7 logical cells (because that's what wcwidth() says). Also, "|" is
the pipe symbol, or a vertical box drawing line, whatever.

Now let's assume that harfbuzz tells you that the desired width for
rendering this "complex" word is 5.3 times the width of the character
cell. Or 8.6 times it. How to proceed? How will the "|" bars align up,
and thus mc's two-panel layout, tmux's vertical split etc. not fall
apart? In the latter case, when the width requested by harfbuzz is
bigger than the designated width, what to with characters that "fall
off" at the right edge of the terminal?



e.


Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Eli Zaretskii via Unicode
> From: Egmont Koblinger 
> Date: Sat, 9 Feb 2019 19:25:08 +0100
> Cc: Richard Wordingham , 
>   unicode Unicode Discussion 
> 
> > You need to use what HarfBuzz tells you _instead_ of wcswidth.  It is
> > in general wrong to use wcswidth or anything similar when you use a
> > shaping engine and support complex script shaping.
> 
> This approach is not viable at all.
> [...]

I'm probably missing something, because I don't see the grave problems
you hint at.  Any width provided back by a shaper can be rounded to
the nearest integral character cell, so your canvas can still remain
rectangular.  And I see no reason why an application should be
bothered by the actual number of character cells occupied by the text
it wrote on display.  So what exactly is not viable in using the width
reported back by the shaper?

> If you say that the font should determine the logical width, you need
> to start building up something brand new from scratch.

Are you saying that a terminal cannot work with variable-pitch fonts?

> Terminal emulators do have strong limitations. Complex text rendering
> can only work to the extent we can squeeze it into these limitations.

No one said anything to the contrary, AFAICT.


Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Egmont Koblinger via Unicode
On Sat, Feb 9, 2019 at 7:07 PM Eli Zaretskii  wrote:

> You need to use what HarfBuzz tells you _instead_ of wcswidth.  It is
> in general wrong to use wcswidth or anything similar when you use a
> shaping engine and support complex script shaping.

This approach is not viable at all.

Terminal emulators have an internal data structure that they maintain,
a matrix of character cells. Every operation is performed here, every
escape sequence is defined on this layer what it does, the cursor
position is tracked on this layer, etc. You can move the cursor to
integer coordinates, overwrite the letter in that cell, and do plenty
of other operations (like push the rest to the right by one cell). If
you change these fundamentals, most of the terminal-based applications
will fall apart big time.

This behavior has to be absolutely independent from the font. The
application running inside the terminal doesn't and cannot know what
font you use, let alone how harfbuzz is about to render it. (You can
even have no font at all, such as with the libvterm headless emulator
library, or a detached screen or tmux session; or have multiple fonts
at the same time if a screen or tmux session is attached from multiple
graphical emulators.)

So one part of a terminal emulator's code is responsible for
maintaining this matrix of characters according to the input it
receives. Another part of their code is responsible for presenting
this matrix of characters on the UI, doing the best it can.

If you say that the font should determine the logical width, you need
to start building up something brand new from scratch. You need to
have something that doesn't have concepts like "width in characters".
You need to redefine cursor movement and many other escape sequences.
You need to heavily adjust the behavior of a gazillion of software,
e.g. zip's two-column output, anything that aligns in columns (e.g.
midnight commander, tmux's vertical split etc.), the shell's (or
readline's) command editing and wrapping to multiple lines, ncurses,
and so on, all the way to e.g. fullscreen text editors like Emacs.

And then we're not talking about terminal emulators anymore, as we
know them now, but something new, something pretty different.

Terminal emulators do have strong limitations. Complex text rendering
can only work to the extent we can squeeze it into these limitations.


cheers,
egmont


Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Eli Zaretskii via Unicode
> Date: Sat, 9 Feb 2019 18:42:52 +0100
> Cc: unicode Unicode Discussion 
> From: Egmont Koblinger via Unicode 
> 
> What if harfbuzz tells us that the overall width for rendering a
> particular grapheme cluster is significantly different from its
> designated area (the number of character cells [wcswidth()]
> multiplied by the width of each)?

You need to use what HarfBuzz tells you _instead_ of wcswidth.  It is
in general wrong to use wcswidth or anything similar when you use a
shaping engine and support complex script shaping.



Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Egmont Koblinger via Unicode
Hi Richard,

On Sat, Feb 9, 2019 at 3:08 PM Richard Wordingham via Unicode
 wrote:

> It would be good to be able to access a maintained statement of the
> VTE rules for allocating characters to a cell, or group of cells, as
> appropriate.

What VTE did, up to a couple of days ago:

It opens the font, and measures the ASCII 33-126 or so characters,
takes their average size (well, in case of monospace font, they should
all have the same size), this determines the cell size.

Then every character cell is rendered individually, using Pango or
Cairo or I'm not sure what exactly – there are like three paths in the
source, the details are unclear to me. A cell might contain a base
character + nonspacing combining accents, these are passed together to
Pango and friends, so they render it as one unit. The glyph is aligned
to the left of its designated cell area, overflowing on the right (and
thus potentially overlapping with the next glyph) if it's wider than
its designated area.

As a special case, two adjacents cells might contain a double wide
(typically CJK) character, but it's not that special after all: it's
also displayed aligned to the left edge of its first cell.

What I improved a couple of days ago (to be released in vte-0.56), for
Devanagari and friends, although I know there's more than this to
address these scripts properly:

If a cell contains a regular letter, and the next cell contains a
spacing combining mark, then these two are passed to Pango in a single
step, that is, the spacing combining mark is applied around its base
letter by Pango as expected. (Previously the spacing combining mark
was rendered on its own, around a dotted circle, which was obviously
pretty bad.)

What I'm working on currently, as you all know by now, is
BiDi-shuffling the cells before rendering them (hopefully for
vte-0.58).

This is how VTE works now, but it's by no means a specification, and
tailoring a font to this behavior is probably not the right approach.
Instead, VTE's behavior should be improved. We have a pending feature
request (which I've already linked) to use HarfBuzz for rendering the
glyphs, which would then render grapheme clusters beautifully. The
problem that I don't know how to address is: What if harfbuzz tells us
that the overall width for rendering a particular grapheme cluster is
significantly different from its designated area (the number of
character cells [wcswidth()] multiplied by the width of each)?


cheers,
egmont




>
> > > (b) With a terminal that expects a fixed width font, surely the
> > > terminal decides how many cells it allocates to a group of
> > > characters, and the font designer has to come up with a suitable
> > > value based on that.
> >
> > Yes.  A terminal emulator that works with a shaper should probably
> > post-process the width information returned by the shaper for these
> > purposes.
>
> Perhaps it should base the number of cells on the width of the
> clusters.  However, continuing with my example, U+1789 KHMER LETTER NYO
> as a base character is too wide to fit in a cell, and the next
> character will overwrite its right-hand part. From this I deduce that it
> is allocated just one cell.  Gnome terminal is not alone in doing this,
> but it does better than some, in my opinion, in that the overflow of the
> foreground of one cell is not obliterated by the background of the
> next cell.  U+1789 has an East Asian width property of 'Neutral', which
> is distinctly unhelpful.
>
> What I would like is a specification of what a font must do to avoid
> such problems.
>
> > > >  I don't see how you can expect wcwidth, or any other
> > > > interface that was designed to work with _characters_, to be
> > > > useful when you need to display grapheme clusters.
>
> It, or something similar but worse, gets used, especially when moving
> the cursor for editing.
>
> > > Well I can envisage a decision being made that a grapheme cluster
> > > str (as decreed by the terminal) shall occupy wcswidth(str) cells -
> > > "The wcswidth() function returns the number of column positions for
> > > the wide-character string s, truncated to at most length n".
> >
> > AFAIU, the shaping engine returns its output in terms of font glyph
> > numbers, not character codepoints, so you cannot in general call
> > wcswidth on them.  The shaper also returns the advance information,
> > which serves instead of wcwidth and related APIs for determining the
> > actual width on display.
>
> Unfortunately, when the rectangular grid is being preserved,
> typographical advance width is generally ignored when determining the
> placement of characters.  Now, this is not always true; one can have
> the situation where the the positioning of characters respects the
> advance widths, but the positioning of the cursor assumes a fixed-width
> rectangular grid.  I have found working with that to be extremely
> confusing.
>
> Richard.
>



Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Richard Wordingham via Unicode
On Sat, 09 Feb 2019 09:42:09 +0200
Eli Zaretskii via Unicode  wrote:

> > Date: Sat, 9 Feb 2019 00:18:14 +
> > From: Richard Wordingham via Unicode 
> >   
> > > For character composition, you must have a shaping engine to talk
> > > to, and the shaper should tell you the width of each grapheme
> > > cluster it returns.  
> > 
> > (a) What defines the grapheme clusters?  The definition might be
> > terminal-specific.  
> 
> Well, the "you" above alluded to the terminal emulator, of course.
> The grapheme clusters are determined by the shaping engine that the
> emulator must call when appropriate (or always).

I find it very hard to believe that that is how it works with GNOME
Terminal (Version 3.18.3, using VTE Version 0.42.5).  At the command
line I typed in the Khmer script string ក្កេក (KA, COENG, KA, SIGN E,
KA), and saw the string split into four columns (KA, COENG), (KA),
(SIGN E), (KA), with each column given the same width. When written
correctly, SIGN E is first in visual order.  The fourth column was
displayed on top of the third column, which contained a dotted circle
to show that SIGN E on its own was not grammatically correct.  If I
were writing a Khmer font for use with Gnome terminal, I would attempt
to ensure that the display for SIGN E fitted in a single cell.

Of course, the renderer's grapheme cluster boundaries don't always
match appearances.  To get the traditional placement of U+1A58 TAI THAM
SIGN MAI KANG LAI, I end up with it being a mark glyph one cluster
later than HarfBuzz indicates it to be.

It would be good to be able to access a maintained statement of the
VTE rules for allocating characters to a cell, or group of cells, as
appropriate. 

> > (b) With a terminal that expects a fixed width font, surely the
> > terminal decides how many cells it allocates to a group of
> > characters, and the font designer has to come up with a suitable
> > value based on that.   
> 
> Yes.  A terminal emulator that works with a shaper should probably
> post-process the width information returned by the shaper for these
> purposes.

Perhaps it should base the number of cells on the width of the
clusters.  However, continuing with my example, U+1789 KHMER LETTER NYO
as a base character is too wide to fit in a cell, and the next
character will overwrite its right-hand part. From this I deduce that it
is allocated just one cell.  Gnome terminal is not alone in doing this,
but it does better than some, in my opinion, in that the overflow of the
foreground of one cell is not obliterated by the background of the
next cell.  U+1789 has an East Asian width property of 'Neutral', which
is distinctly unhelpful.

What I would like is a specification of what a font must do to avoid
such problems.

> > >  I don't see how you can expect wcwidth, or any other
> > > interface that was designed to work with _characters_, to be
> > > useful when you need to display grapheme clusters.  

It, or something similar but worse, gets used, especially when moving
the cursor for editing.

> > Well I can envisage a decision being made that a grapheme cluster
> > str (as decreed by the terminal) shall occupy wcswidth(str) cells -
> > "The wcswidth() function returns the number of column positions for
> > the wide-character string s, truncated to at most length n".  
> 
> AFAIU, the shaping engine returns its output in terms of font glyph
> numbers, not character codepoints, so you cannot in general call
> wcswidth on them.  The shaper also returns the advance information,
> which serves instead of wcwidth and related APIs for determining the
> actual width on display.

Unfortunately, when the rectangular grid is being preserved,
typographical advance width is generally ignored when determining the
placement of characters.  Now, this is not always true; one can have
the situation where the the positioning of characters respects the
advance widths, but the positioning of the cursor assumes a fixed-width
rectangular grid.  I have found working with that to be extremely
confusing.

Richard.



Re: Bidi paragraph direction in terminal emulators

2019-02-09 Thread Eli Zaretskii via Unicode
> From: Elias Mårtenson 
> Date: Sat, 9 Feb 2019 13:33:49 +0800
> Cc: Egmont Koblinger , unicode 
> 
>  Moreover, emitting the control sequences that set the mode is in
>  itself a complication, because if the terminal doesn't support them,
>  the result could be corrupted display.  You will need methods of
>  detecting the support, and those detection methods usually involve
>  sending another control sequence to the terminal and waiting for
>  response, something that complicates applications and causes delays in
>  displaying output.
> 
> That's what the TERM environment variable is for though.

That's not indicative enough when some version of a terminal starts to
support a feature not supported by previous versions of the same
terminal.  Happens a lot with terminal emulators such as xterm, which
are under active development, and add features all the time.


Re: Bidi paragraph direction in terminal emulators

2019-02-08 Thread Eli Zaretskii via Unicode
> Date: Sat, 9 Feb 2019 00:18:14 +
> From: Richard Wordingham via Unicode 
> 
> > For character composition, you must have a shaping engine to talk to,
> > and the shaper should tell you the width of each grapheme cluster it
> > returns.
> 
> (a) What defines the grapheme clusters?  The definition might be
> terminal-specific.

Well, the "you" above alluded to the terminal emulator, of course.
The grapheme clusters are determined by the shaping engine that the
emulator must call when appropriate (or always).

> (b) With a terminal that expects a fixed width font, surely the
> terminal decides how many cells it allocates to a group of characters,
> and the font designer has to come up with a suitable value based on
> that. 

Yes.  A terminal emulator that works with a shaper should probably
post-process the width information returned by the shaper for these
purposes.

> >  I don't see how you can expect wcwidth, or any other
> > interface that was designed to work with _characters_, to be useful
> > when you need to display grapheme clusters.
> 
> Well I can envisage a decision being made that a grapheme cluster str
> (as decreed by the terminal) shall occupy wcswidth(str) cells - "The
> wcswidth() function returns the number of column positions for the
> wide-character string s, truncated to at most length n".

AFAIU, the shaping engine returns its output in terms of font glyph
numbers, not character codepoints, so you cannot in general call
wcswidth on them.  The shaper also returns the advance information,
which serves instead of wcwidth and related APIs for determining the
actual width on display.


Re: Bidi paragraph direction in terminal emulators

2019-02-08 Thread Elias Mårtenson via Unicode
On Wed, 6 Feb 2019, 00:09 Eli Zaretskii via Unicode 
> Moreover, emitting the control sequences that set the mode is in
> itself a complication, because if the terminal doesn't support them,
> the result could be corrupted display.  You will need methods of
> detecting the support, and those detection methods usually involve
> sending another control sequence to the terminal and waiting for
> response, something that complicates applications and causes delays in
> displaying output.
>

That's what the TERM environment variable is for though.

Regards,
Elias


Re: Bidi paragraph direction in terminal emulators

2019-02-08 Thread Richard Wordingham via Unicode
On Sat, 09 Feb 2019 00:16:30 +0200
Eli Zaretskii via Unicode  wrote:

> > Date: Fri, 8 Feb 2019 21:55:58 +
> > From: Richard Wordingham via Unicode 

> > I will give a concrete application.  If I want to make a font that
> > is interpretable for Tai Tham and maximally usable with VTE, what
> > are the VTE-specific constraints for me to be able to use it for
> > Tai Tham when using basic text utilities?  For example, if VTE
> > decides that for  as two clusters
> >  and , can I nevertheless position
> > the above-matra above the ?  

> For character composition, you must have a shaping engine to talk to,
> and the shaper should tell you the width of each grapheme cluster it
> returns.

(a) What defines the grapheme clusters?  The definition might be
terminal-specific.

(b) With a terminal that expects a fixed width font, surely the
terminal decides how many cells it allocates to a group of characters,
and the font designer has to come up with a suitable value based on
that. 

>  I don't see how you can expect wcwidth, or any other
> interface that was designed to work with _characters_, to be useful
> when you need to display grapheme clusters.

Well I can envisage a decision being made that a grapheme cluster str
(as decreed by the terminal) shall occupy wcswidth(str) cells - "The
wcswidth() function returns the number of column positions for the
wide-character string s, truncated to at most length n".

Richard.


Re: Bidi paragraph direction in terminal emulators

2019-02-08 Thread Eli Zaretskii via Unicode
> Date: Fri, 8 Feb 2019 21:55:58 +
> From: Richard Wordingham via Unicode 
> 
> > > What's the sledgehammer for Windows?  
> 
> > Not sure what you meant.  "M-x term" doesn't work on Windows.
> 
> So my question is, 'What do I use on Windows?'  The application may be
> disproportionate to the function I use it for.

Try "M-x shell".  Most of "M-x term" is not needed on Windows anyway,
because the Windows console doesn't support SGR escapes and other
curses-like functionalities, at least not yet.

> I will give a concrete application.  If I want to make a font that is
> interpretable for Tai Tham and maximally usable with VTE, what are the
> VTE-specific constraints for me to be able to use it for Tai Tham when
> using basic text utilities?  For example, if VTE decides that for
>  as two clusters  and  above-matra>, can I nevertheless position the above-matra above the
> ?

For character composition, you must have a shaping engine to talk to,
and the shaper should tell you the width of each grapheme cluster it
returns.  I don't see how you can expect wcwidth, or any other
interface that was designed to work with _characters_, to be useful
when you need to display grapheme clusters.


Re: Bidi paragraph direction in terminal emulators

2019-02-08 Thread Richard Wordingham via Unicode
On Fri, 08 Feb 2019 11:34:29 +0200
Eli Zaretskii via Unicode  wrote:

> > Date: Fri, 8 Feb 2019 06:40:44 +
> > From: Richard Wordingham via Unicode 
> >   
> > > I, for one, am not to the slightest bit interested in abandoning
> > > the character grid and allowing for proportional fonts. This
> > > would just break a gazillion of things.  
> > 
> > The message I take from that and this thread in general is that
> > Emacs and 'M-x term' are the route to take if one only has
> > proportional fonts.  
> 
> Not sure why.  There are terminal emulators out there which support
> proportional fonts.  Emacs is perhaps the only one whose terminal
> emulator currently supports bidi more or less in full, but is that
> related to proportional fonts?

Emacs is the one I know that can be made to support Indic fonts.  It's
rather a big too for such a relatively minor task, which is why I
implicitly called it a sledgehammer.

> > What's the sledgehammer for Windows?  

> Not sure what you meant.  "M-x term" doesn't work on Windows.

So my question is, 'What do I use on Windows?'  The application may be
disproportionate to the function I use it for.

> > Where do I find the specification for fixed-width fonts (is
> > wcswidth() the core?) and how do I select the set of fonts to use?
> > Do I need to use fontconfig where available?  

> That depends on the underlying C library and other facilities;
> basically on your OS.  AFAIK wcwidth will give the results consistent
> with the UCD only if you use glibc.  In Emacs, you have the functions
> char-width and string-width that take their data from
> EastAsianWidth.txt.  Not sure about other facilities, and I don't
> really understand what environment are you asking about -- are you
> talking about C/C++ programs?

I will give a concrete application.  If I want to make a font that is
interpretable for Tai Tham and maximally usable with VTE, what are the
VTE-specific constraints for me to be able to use it for Tai Tham when
using basic text utilities?  For example, if VTE decides that for
 as two clusters  and , can I nevertheless position the above-matra above the
?

Richard. 


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-08 Thread Egmont Koblinger via Unicode
On Fri, Feb 8, 2019 at 10:36 PM Eli Zaretskii  wrote:

> No one in their right minds will run Emacs inside the Emacs terminal
> emulator.  And even for other applications, disabling bidi will almost
> always needed only for full-screen programs, which use curses-like
> libraries to address the entire screen.  So you'd switch off
> reordering for the entire time you are running such an app, then
> switch it back on after exiting.

Exactly.

But the question is: should it be the user to manually switch it
on/off, or should it happen for them automatically under the hood? If
the latter, how? My BiDi proposal answers this. Do you have another
possible answer?

> Are there any terminal emulators that support these sequences?

Prior to my specs: Not that I'm aware of. As of my work being
available: at least VTE and Mintty are working on it, and I know that
iTerm2 was also waiting for some specification. I'm sincerely hoping
for even more to follow.


e.


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-08 Thread Eli Zaretskii via Unicode
> From: Egmont Koblinger 
> Date: Fri, 8 Feb 2019 17:44:53 +0100
> Cc: Richard Wordingham , 
>   unicode Unicode Discussion 
> 
> For certain apps, one of the modes is required (e.g. for cat it's the
> implicit mode). For other tasks it's the other mode (e.g. for emacs
> the explicit mode).

No one in their right minds will run Emacs inside the Emacs terminal
emulator.  And even for other applications, disabling bidi will almost
always needed only for full-screen programs, which use curses-like
libraries to address the entire screen.  So you'd switch off
reordering for the entire time you are running such an app, then
switch it back on after exiting.

The other, simpler text applications will always need reordering to
active.

> > You can hardly expect Emacs (or any other application) to support
> > control sequences that are not yet defined, let alone standardized.
> 
> The most essential sequence, BDSM to switch between implicit and
> explicit modes, has been defined for like 28 years now. Sure I bring
> slight changes and clarifications to it, as well as introduce new
> ones. As of my recommendation which I've announced, these new ones are
> defined as well.

Are there any terminal emulators that support these sequences?

> > When they become sufficiently widely available, I'm sure someone will
> > add them to Emacs.
> 
> There's always a chicken and egg problem with this attutide. At the
> very least, I'm kindly asking Emacs to emit BDSM so that when it's
> fired up on a gnome-terminal, it'll have the terminal's BiDi
> automatically disabled.

Feel free to file a feature request with the Emacs bug tracker about
this.  Somebody, maybe even myself, is likely to act on that at some
point.


Columns in Terminal Emulators (was: Bidi paragraph direction in terminal emulators)

2019-02-08 Thread Richard Wordingham via Unicode
On Fri, 08 Feb 2019 15:45:15 +0200
Eli Zaretskii via Unicode  wrote:

> > From: Egmont Koblinger 
> > Date: Fri, 8 Feb 2019 13:30:42 +0100
> > Cc: Richard Wordingham , 
> > unicode Unicode Discussion 
> > 
> > Hi Eli,
> >   
> > > Not sure why.  There are terminal emulators out there which
> > > support proportional fonts.  
> > 
> > Well, of course, a terminal emulator can load any font, even
> > proportional, but as it places them in the grid, it will look ugly
> > as hell  
> 
> Maybe so, but the original text was this:
> 
>   Emacs and 'M-x term' are the route to take if one only has
>   proportional fonts.
> 
> Which I don't understand, since the terminal emulator in Emacs doesn't
> do anything special about proportional fonts, AFAIK.

As a terminal emulator, it does.  It abandons straight columns to
honour the spacing glyphs' widths.  It neither inappropriately
truncates nor inappropriately overlaps glyphs.  These avoided
treatments don't just make text ugly; they can make it unreadable.

Richard.


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-08 Thread Egmont Koblinger via Unicode
Hi Eli,

> Why would they want to toggle it back and forth?  What are the use
> cases where it makes sense to mix both modes?  IME, you either need
> one or the other, never both.

(Back to the basics, which are mentioned pretty clearly in my
specification, I believe, and I've also described here multiple
times... sigh.)

For certain apps, one of the modes is required (e.g. for cat it's the
implicit mode). For other tasks it's the other mode (e.g. for emacs
the explicit mode).

In a typical terminal session, you don't just use one of these kinds
of commands. You use various commands in a sequence, e.g. a cat
followed by an emacs, then a zip, then whatnot, then emacs again, then
a cat and a grep, etc...

The very last thing I would want to do as a user is to toggle some
setting back and forth, let alone remember which command needs which
mode.

> You can hardly expect Emacs (or any other application) to support
> control sequences that are not yet defined, let alone standardized.

The most essential sequence, BDSM to switch between implicit and
explicit modes, has been defined for like 28 years now. Sure I bring
slight changes and clarifications to it, as well as introduce new
ones. As of my recommendation which I've announced, these new ones are
defined as well.

It's probably never going to be a de jure standard, adopted by ECMA or
whatever "authority", but that's not what happens anywhere else in
terminal emulators nowadays. An "authority" which doesn't keep up to
date with innovations, doesn't have a feedback forum, and hasn't
released a new version for 28 years, is clearly not suitable for
making progress.

We have just announced a public forum called "Terminal WG" for
terminal emulator developers to collaborate and join their efforts
wrt. new extensions, rather than ad-hoc collaborations or each going
their own separate ways. We'd like its work to be widely accepted as a
basis for the desired behavior. My BiDi work is one of the works
hosted there. It'll probably never be an "authority" like ECMA, but
hopefully will be some kind of well-respected place of specs to adhere
to.

> When they become sufficiently widely available, I'm sure someone will
> add them to Emacs.

There's always a chicken and egg problem with this attutide. At the
very least, I'm kindly asking Emacs to emit BDSM so that when it's
fired up on a gnome-terminal, it'll have the terminal's BiDi
automatically disabled. This has nothing to do yet with Emacs's
built-in terminal emulator. Addressing that is sure a much bigger
chunk of work; I hope it'll happen if my BiDi proposal indeed turns
out to be successful.


cheers,
egmont


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-08 Thread Eli Zaretskii via Unicode
> From: Egmont Koblinger 
> Date: Fri, 8 Feb 2019 15:42:51 +0100
> Cc: Richard Wordingham , 
>   unicode Unicode Discussion 
> 
> On Fri, Feb 8, 2019 at 3:28 PM Eli Zaretskii  wrote:
> 
> > You can have what you call the "explicit mode" if you set the variable
> > bidi-display-reordering to nil.
> 
> So, if someone is running a mixture of applications requiring implicit
> vs. explicit modes, they'll have to continuously toggle the setting of
> their terminal back and forth.

Why would they want to toggle it back and forth?  What are the use
cases where it makes sense to mix both modes?  IME, you either need
one or the other, never both.

In any case, I'm just trying to help you map your requirements into
existing Emacs features.  If this is not helpful, feel free to
disregard.

> Now, I, as a user, want BiDi to work as seamlessly as possible,
> definitely without me having to repeatedly switch a setting back and
> forth if the applications could just as well do it automatically. One
> of the basics of my spec.
> 
> Whether Emacs will adopt this, or will keep requiring users to toggle
> this setting back and forth depending on the particular app they wish
> to run, is not my call.

You can hardly expect Emacs (or any other application) to support
control sequences that are not yet defined, let alone standardized.
When they become sufficiently widely available, I'm sure someone will
add them to Emacs.


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-08 Thread Egmont Koblinger via Unicode
On Fri, Feb 8, 2019 at 3:28 PM Eli Zaretskii  wrote:

> You can have what you call the "explicit mode" if you set the variable
> bidi-display-reordering to nil.

So, if someone is running a mixture of applications requiring implicit
vs. explicit modes, they'll have to continuously toggle the setting of
their terminal back and forth. Just as for Konsole and friends there's
a graphical setting, correspondingly for Emacs's terminal there's this
bidi-display-reordering setting.

Now, I, as a user, want BiDi to work as seamlessly as possible,
definitely without me having to repeatedly switch a setting back and
forth if the applications could just as well do it automatically. One
of the basics of my spec.

Whether Emacs will adopt this, or will keep requiring users to toggle
this setting back and forth depending on the particular app they wish
to run, is not my call.


cheers,
egmont


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-08 Thread Eli Zaretskii via Unicode
> From: Egmont Koblinger 
> Date: Fri, 8 Feb 2019 14:57:56 +0100
> Cc: Richard Wordingham , 
>   unicode Unicode Discussion 
> 
> According to the description you give, Emacs's terminal always applies
> the BiDi algorithm, therefore by its design only implements what I
> call "implicit mode", and not the "explicit mode".

You can have what you call the "explicit mode" if you set the variable
bidi-display-reordering to nil.  This only supports the LTR explicit
mode, though.  Personally, I don't see when would the RTL explicit
mode be useful: there's no RTL-only text in real life, so some
reordering is always required.  But maybe I'm missing something.

> I'm making the strong claim that by running the UBA a terminal
> emulator doesn't become BiDi aware, there's much more it needs to do.

Like I said, you are welcome to test the rest of your requirements and
ask questions if you think something is not supported or isn't working
as expected.


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-08 Thread Egmont Koblinger via Unicode
Hi Eli,

> Emacs implements the latest UBA from Unicode 11; and the Emacs
> terminal emulator inserts all the text into a "normal" Emacs buffer,
> and displays that buffer as any other buffer.  So yes, you have there
> full UBA support.

One of the essentials of my work is that there's much more to BiDi in
terminal emulators than running the UBA. If one takes a step backwards
to look at the big picture, it becomes clear that in some cases the
UBA needs to be run, while in other cases it mustn't. And then of
course there needs to be some means of switching, and so on...

According to the description you give, Emacs's terminal always applies
the BiDi algorithm, therefore by its design only implements what I
call "implicit mode", and not the "explicit mode".

On the other hand, in order to run Emacs inside a terminal emulator,
you need to set that terminal emulator to explicit mode, so that it
doesn't reshuffle the characters. The behavior it expects from the
outer terminal doesn't match the behavior it provides in its inner
one. As an interesting consequence, if you open Emacs, then inside it
a terminal emulator, and then inside it an Emacs, it will display BiDi
incorrectly, in reversed order.

I'm making the strong claim that by running the UBA a terminal
emulator doesn't become BiDi aware, there's much more it needs to do.


cheers,
egmont


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-08 Thread Eli Zaretskii via Unicode
> From: Egmont Koblinger 
> Date: Fri, 8 Feb 2019 13:30:42 +0100
> Cc: Richard Wordingham , 
>   unicode Unicode Discussion 
> 
> Hi Eli,
> 
> > Not sure why.  There are terminal emulators out there which support
> > proportional fonts.
> 
> Well, of course, a terminal emulator can load any font, even
> proportional, but as it places them in the grid, it will look ugly as
> hell

Maybe so, but the original text was this:

  Emacs and 'M-x term' are the route to take if one only has
  proportional fonts.

Which I don't understand, since the terminal emulator in Emacs doesn't
do anything special about proportional fonts, AFAIK.

> In Emacs-25.2's terminal emulator I executed "cat TUTORIAL.he". For
> the entire contents, LTR paragraph direction was used and was aligned
> to the left. Maybe something has changed for 26.x, I don't know.

I told you what changed: Emacs 25 forces LTR paragraph direction,
whereas Emacs 26 and later does not.  You can get dynamic paragraph
direction in your Emacs 25 as well if you set bidi-paragraph-direction
to nil in the *term* buffer.

> And now you suddenly tell that Emacs's terminal supports BiDi more or
> less in full???

Emacs implements the latest UBA from Unicode 11; and the Emacs
terminal emulator inserts all the text into a "normal" Emacs buffer,
and displays that buffer as any other buffer.  So yes, you have there
full UBA support.  I thought this was clear, sorry if it wasn't.  One
caveat with this is that the Emacs emulator works only on Posix
platforms, it doesn't work on MS-Windows.

> Sorry, I just don't buy it. If you retain this claim, I'd pretty
> please like to see a specification of its behavior

The specification is the latest version of the UBA, augmented with
three deviations, two of them allowed by the UBA, the third isn't:

  . Emacs uses HLA1 for determining base paragraph direction: it
decides on base direction only once for every chunk of text
delimited by empty lines;

  . Emacs doesn't by default remove bidi formatting controls from
display;

  . Emacs wraps long lines _after_ reordering, not before.

I think that's it.  If I forget something, please forgive me: I
implemented this 10 years ago, so maybe something evades me at the
moment.

> one which addresses at least all the major the issues I address in
> my work, one which I could replace my work with, one which I'd be
> happy to implement in gnome-terminal in the solid belief that it's
> about as good as my proposal, and would wholeheartedly recommend for
> other terminal emulators to adopt.
> 
> Or maybe, by any chance, when you said Emacs's terminal supported BiDi
> more or less in full, did you perhaps went with your own idea what a
> BiDi-aware terminal emulator needs to support; ignoring all those
> things I detail in my work, such as the inevitable need for explicit
> mode, the need for deciding the scope of implicit vs. explicit mode,
> and much more?

Sorry, I cannot afford testing everything you wrote in your
specification.  I think most, if not all, of that is covered, but I
certainly didn't test that, so maybe I'm wrong.  Please feel free to
test the relevant aspects and ask questions if you need more "inside
information".  I do hope that my impression about "most everything
being supported" is correct, because that would give you a working
implementation/prototype of most of the features you want to see in
terminal emulators, so you could actually try the behavior to see if
it's convenient, causes problems, etc.

One other feature you may find interesting (something that I don't
think you covered in your document, at least not explicitly) is that
Emacs supports visual-order cursor motion, in addition to the "usual"
logical-order.  The latter is, of course, the default, but you can
switch to the former if you set the visual-order-cursor-movement
option to a non-nil value.


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-08 Thread Egmont Koblinger via Unicode
Hi Philippe,

> Adding a single bit of protection in cell attributes to indicate they are 
> either protected or become transparent (and the rest of the 
> attributes/character field indicates the id of another terminal grid or 
> rendering plugin crfeating its own layer and having its own scrolling state 
> and dimensions) can allow convenient things, including the possibility of 
> managing a grid-based system of stackable windows.
> You can design one of the layer to allow input (managed directly in the 
> terminal, with local echo without transmission delays and without risks of 
> overwriting surrounding contents.

At this point you're already touching much more the core of terminal
emulator behavior than e.g. my BiDi work does, it's a way more
essential, way more complex change – with much less clear goal to me,
like, why should emulators implement it, why would applications start
using it etc. If you wish to go for this direction, good luck!

(If anything, what I do see somewhat feasibile, is building up
something from scratch that looks much more like a proportional-font
text editing widget, or even a rich text editor, rather than terminal
emulator, and figure out step by step how to get a shell and simple
utilities and later more complex utilities run in that. This could be
a new platform which, by putting decades of hard work in it – which I
cannot do voluntarily –, could eventually replace terminal emulators.)

Philippe, I hate do say it, but at the risk of being impolite, I just
have to. Your ideas would take terminal emulators extremely far from
what they are now, with no clear goals and feasibility to me; and are
no longer any relevant to BiDi. All I see is we're wasting each
other's time on utterly irrelevant topics, and since I see exactly
zero chance of any worthful takeaway to come out of this,
unfortunately I cannot anymore devote my limited free time for this, I
just have to quit this conversation between the two of us. I'm really
sorry.


best regards,
egmont



Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-08 Thread Egmont Koblinger via Unicode
Hi Eli,

> Not sure why.  There are terminal emulators out there which support
> proportional fonts.

Well, of course, a terminal emulator can load any font, even
proportional, but as it places them in the grid, it will look ugly as
hell (like this one: https://askubuntu.com/q/781327/398785 ). Sure you
could apply some tricks to make it look a bit less terrible (e.g. by
centering each glyph in its cell rather than aligning to the left),
but it still won't look great.

In the world of terminal emulation, many applications expect things to
align properly according to the wcwidth() of the string they emit. You
abandon this (start placing the glyphs one after the other in a row,
no matter how wide they are), and plenty of applications suddenly fall
apart big time (let alone questions like how you define the terminal's
width in characters).

> Emacs is perhaps the only one whose terminal
> emulator currently supports bidi more or less in full

Let's not get started from here, please.

In Emacs-25.2's terminal emulator I executed "cat TUTORIAL.he". For
the entire contents, LTR paragraph direction was used and was aligned
to the left. Maybe something has changed for 26.x, I don't know.

In my work I carefully evaluated 4 other "BiDi-aware" terminal
emulators, as well an ancient specification for BiDi which I had to
read about twenty times to get to pretty much understand what it's
talking about. Identified substantial issues with both the standard as
well as all the independent implementations (which didn't care about
this standard at all). I show that existing terminal emulators are
incompatible to the extent that an app cannot reliably print any RTL
text by any means at all. At this point I firmly believe it should be
clear that BiDi in terminals is not a topic where one can just go
ahead and do something, without having a specification first. I lay
down principles which a proper BiDi-supporting platform I believe
needs to meet, argue why multiple modes (explicit and implicit) are
inevitable, examine what to do with paragraph direction, cursor
location and tons of other issues, and come up with concrete
suggestion how (partially based on that ancient specifications) these
all should be exactly addressed.

Then, after putting literally months of work in it, I come here to
announce my work and ask for feedback. So far, from a thread of 100+
mails, I take away two pieces of worthful feedback: one is that
shaping should be done differently, and the other one is that – for
some use cases – a bigger scope of data should be used for
autodetecting the "paragraph direction" (as per UBA's terminology).

And now you suddenly tell that Emacs's terminal supports BiDi more or
less in full???

Sorry, I just don't buy it. If you retain this claim, I'd pretty
please like to see a specification of its behavior, one which
addresses at least all the major the issues I address in my work, one
which I could replace my work with, one which I'd be happy to
implement in gnome-terminal in the solid belief that it's about as
good as my proposal, and would wholeheartedly recommend for other
terminal emulators to adopt.

Or maybe, by any chance, when you said Emacs's terminal supported BiDi
more or less in full, did you perhaps went with your own idea what a
BiDi-aware terminal emulator needs to support; ignoring all those
things I detail in my work, such as the inevitable need for explicit
mode, the need for deciding the scope of implicit vs. explicit mode,
and much more?


thanks a lot,
egmont



Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-08 Thread Eli Zaretskii via Unicode
> Date: Fri, 8 Feb 2019 06:40:44 +
> From: Richard Wordingham via Unicode 
> 
> > I, for one, am not to the slightest bit interested in abandoning the
> > character grid and allowing for proportional fonts. This would just
> > break a gazillion of things.
> 
> The message I take from that and this thread in general is that Emacs
> and 'M-x term' are the route to take if one only has proportional fonts.

Not sure why.  There are terminal emulators out there which support
proportional fonts.  Emacs is perhaps the only one whose terminal
emulator currently supports bidi more or less in full, but is that
related to proportional fonts?

> What's the sledgehammer for Windows?

Not sure what you meant.  "M-x term" doesn't work on Windows.

> Where do I find the specification for fixed-width fonts (is
> wcswidth() the core?) and how do I select the set of fonts to use?  Do I
> need to use fontconfig where available?

That depends on the underlying C library and other facilities;
basically on your OS.  AFAIK wcwidth will give the results consistent
with the UCD only if you use glibc.  In Emacs, you have the functions
char-width and string-width that take their data from
EastAsianWidth.txt.  Not sure about other facilities, and I don't
really understand what environment are you asking about -- are you
talking about C/C++ programs?


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-07 Thread Eli Zaretskii via Unicode
> Date: Thu, 7 Feb 2019 22:35:23 +
> From: Richard Wordingham via Unicode 
> 
> > > Do you mean you aim to maintain a regex that matches everyone's
> > > prompt in the world, without a significant amount of false positive
> > > matches on non-prompt lines?  
> 
> > Yes.
> 
> Wow!  You'll do well to match a prompt such as '2p ', which I used for
> a while.

Like I said: for any reasonable prompt that doesn't match, you can
report a bug, and have the Emacs maintainers deliberate whether your
case is important enough to be supported by default.  Failing that,
you can set the regexp to a suitable value in a mode hook defined on
your init file.


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-07 Thread Richard Wordingham via Unicode
On Fri, 8 Feb 2019 00:38:24 +0100
Egmont Koblinger via Unicode  wrote:

> I, for one, am not to the slightest bit interested in abandoning the
> character grid and allowing for proportional fonts. This would just
> break a gazillion of things.

The message I take from that and this thread in general is that Emacs
and 'M-x term' are the route to take if one only has proportional fonts.
What's the sledgehammer for Windows?

Where do I find the specification for fixed-width fonts (is
wcswidth() the core?) and how do I select the set of fonts to use?  Do I
need to use fontconfig where available?

Richard.


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-07 Thread Philippe Verdy via Unicode
Adding a single bit of protection in cell attributes to indicate they are
either protected or become transparent (and the rest of the
attributes/character field indicates the id of another terminal grid or
rendering plugin crfeating its own layer and having its own scrolling state
and dimensions) can allow convenient things, including the possibility of
managing a grid-based system of stackable windows.
You can design one of the layer to allow input (managed directly in the
terminal, with local echo without transmission delays and without risks of
overwriting surrounding contents.
Asynchronous behavior can be defined as well between the remote
application/OS and the local processing in the terminal.
The protocol can also support an extension to provide alternate streams
(take an example on MIME multipart). This can even be used to transport the
inputs and outputs for each layer, and additional streams to support
(java)scripts, or the content of an image, or a link to a video stream.
And just like with classing graphics interface, you can have more than just
solid RGB colors and add an alpha layer. The single-rectangular-flat grid
design is not the only option. Layered approaches can then even be rendered
on hardware easily by mapping these virtual layers and flattening them
internally in the terminal emulator to the single flat grid supported by
the hardware. The result is more or less equivalent to graphic RGB frames,
except that the unit is not a single pixel but a whole cell with not just
one color but a pair of colors and an encoded character and a font selected
for that cell, or if a single font is supported, using a dynamic font and
storing glyph ids in that font (prescaled for the cell size). The hardware
then makes the rest to build the pixels of the frame, but it can be easily
accelerated.
The layered approache could also be used to link together the cells that
use the same script and font settings, in order to use proportional fonts
when monospaced fonts are not usable, and justify their text in the field
(which may turn to be scrollable itself when needed for input). Having
multiple communication streams between the terminal emulator and the remote
application allows the application to query the properties and behave in a
smarter way than with just static "termcaps" not taking into account the
actual state of the remote terminal.
All this requires some extension to TV-like protocols (using specific
escape sequences, just like with the Xterm extensions for X11).

You can also reconsider how "old" mainframes terminals worked: the user in
fact never submitted characters one by one to the remote application: the
application was sending a full screen and an input form, the user on its
terminal could fill in the form and press a "submit/send" button when he
had finished inputing the data. But while the user was inputing data, there
was absolutely no need to communicate each typed keystroke to the
application, all was taken in charge by the terminal itself which was
instructed (and could even perform form data validation with input formats
and some conditions, possibly as well a script). In other words, they
worked mostly like an HTML input form with a submit button.

Such mode is very useful for small devices because they don't have to react
interactively with the user, the transmission delays (which may be slow)
are no longer a problem, user can enter and correct data easily, and the
editing facilities don'ty need to be handled by the remote application
(which today could be a very tiny device with in fact much less processing
power than the terminal emulator, and would have in fact no knowledge at
all of the fonts needed) A terminal emulator can make a lot of things
itself and locally. And this would also be useful on many modern
application servers that need to serve lot of remote clients, possibly over
very slow internet links and long roundtrip times.

The idea behing this is to allow to distribute the workload and decide
which side will handle part of all of the I/O. Of course it will transport
text (preferably in an Unicode UTF), but text is not the only content to
transport. There are also audio/video/images, security items (certificates,
signatures, personal data that should remain private and be encrypted, or
only sent to the application in a on-way-hashed form), plus some
states/flags that could provide visual/audio hints to the user when working
in the rendered input/output form with his local terminal emulator.

I spoke about HTML because terminal-based browsers already exist since
long, some of them which are still maintained in 2019 (w3m still used as a
W3C-sponsored demo, Lynx is best known on Linux, or elinks):
  https://www.slant.co/topics/4702/~web-browsers-that-run-in-a-terminal
This gives a good idea of what is needed, what a good terminal protocol can
do, and what the many legacy VT-like protocol variants have never treid to
unify. These browsers don't reinvent the wheel: HTML 

Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-07 Thread Egmont Koblinger via Unicode
Hi Philippe,

> I have never said anything about your work because I don't know where you 
> spoke about it or where you made some proposals. I must have missed one of 
> your messages (did it reach this list?).

This entire conversation started by me announcing here my work, aiming
to bring usable BiDi to terminal emulators.

> Terminals are not displaying plain text, they create their own upper layer 
> protocol which requires and enforces the 2D layout [...] Bidi does not 
> specify the 2D layout completely, it is purely 1D and speaks about left and 
> right direction

That's one of the reasons why it's not as simple as "let's just run
the UBA inside the terminal", one of the reasons why gluing the two
worlds together requires a substantial amount of design work.

> For now terminal protocols, and emulators trying to implement them; that must 
> mix the desynchronized input and output (especially when they have to do 
> "local echo" of the input [...]

I assume by "local echo" you're talking about the Send/Receive Mode
(SRM) of terminals, and not the "stty echo" line discipline setting of
the kernel, because as far as the terminal emulator is concerned, the
kernel is already remote, and it's utterly irrelevant for us whether
it's the kernel or the application sending back the character.

SRM is only supported by a few terminal emulators, and we're about to
drop it from VTE, too (https://gitlab.gnome.org/GNOME/vte/issues/69).

> If you look at historic "terminal" protocols,

I'm mostly interested in the present and future. In the past, only for
curiosity, and to the extent necessary to understand the present and
to plan for the future.

> Some older terminal protocols for mainframes notably were better than today's 
> VT-like protocols: you did not transmit just what would be displayed, but you 
> also described the screen area where user input is allowed and the position 
> of fields and navigation between them:

This is not seen in today's graphical terminal emulators.

> Today these links are better used with real protocols made for 2D and 
> allowing an web application to mange the input with presentation layer (HTML) 
> and with javascript helpers (that avoid the roundtrip time).

Sure, if you need another tool, let's say a dynamic webpage in your
browser, rather than a terminal emulator to perform your taks
effectively, so be it. I'm not claiming terminal emulators are great
for everything, I'm not claiming terminal emulators should be used for
everything.

> But basic text terminals have never evolved and have lagged behind today's 
> need.

I disagree with the former part. There are quite a few terminal
emulators out there, and many have added plenty of new great features
recently.

Whether they're up to today's needs, depends on what your needs are.
If you need something utterly different, go ahead and use whatever
that is, such as maybe a web browser. If you're good with terminals,
that's fine too. And there's a slim area where terminal emulators are
mostly good for you, you'd just need a tiny little bit more from them.
And maybe for some people this tiny little bit more happens to be
BiDi.

> Most of them were never tested for internationalization needs:

Terminal emulators weren't created with internationalization in mind.
I18n goals are added one by one. Nowadays combining accents and CJK
are supported by most emulators. Time to stretch it further with BiDi,
shaping, spacing combining marks for Devanagari, etc.

> [...] delimit input fields in input forms for mainframes, something that was 
> completely forgotten and remains forgotten today with today's VT-* protocols, 
> to indicate which side of the communcation link controls the content of 
> specific areas

Something that was completely forgotten, probably for good reasons,
and I don't see why it should be brought back.

> As well today's VT-* protocols have no possibility to be scriptable: 
> implemeint a way to transport fragments of javascripts would be fine.

I have absolutely no incentive to work in this direction.

> Text-only terminals are now aging but no longer needed for user-friendly 
> interaction, they are used for technical needs where the only need is to be 
> able to render static documents without interactiving with it, except 
> scrolling it down, and only if they provide help in the user's language.

Text-only terminals are no longer needed??? Well, strictly speaking,
computers aren't needed either, people lived absolutely fine lives
before they were invented :)

If you get to do some work, depending on the kind of work, terminal
emulators may or may not be a necessary or a useful tool for you. For
certain tasks you don't really have anything else, or at least
terminals are way more effective than other approaches. For other
tasks (e.g. text editing) it's mostly a matter of taste whether you
use a terminal or a graphical app. For yet other tasks, terminal
emulators take you nowhere.

My work aims to bring BiDi into 

Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-07 Thread Richard Wordingham via Unicode
On Thu, 07 Feb 2019 22:00:20 +0200
Eli Zaretskii via Unicode  wrote:

> > From: Egmont Koblinger 
> > Date: Thu, 7 Feb 2019 19:01:33 +0100

> > On Thu, Feb 7, 2019 at 6:53 PM Eli Zaretskii  wrote:

> > > No, it needs no interaction.  Unless the regexp doesn't work for
> > > you, which you should then report as a bug in Emacs.  

> > Do you mean you aim to maintain a regex that matches everyone's
> > prompt in the world, without a significant amount of false positive
> > matches on non-prompt lines?  

> Yes.

Wow!  You'll do well to match a prompt such as '2p ', which I used for
a while.

Richard.


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-07 Thread Philippe Verdy via Unicode
Le jeu. 7 févr. 2019 à 19:38, Egmont Koblinger  a écrit :

> As you can see from previous discussions, there's a whole lot of
> confusion about the terminology.


And it was exactly the subject of my first message sent to this thread !
you probably missed it.


> Philippe, with all due respect, I have the feeling that you have some
> fundamental problems with my work (and I'm temped to ask back: have
> you read it at all?), but your message what your problem is just
> doesn't come across to me. Could you please avoid all those irrelevant
> stories with baud rate and font size and Asian scripts and whatnot,
> and clearly get to your point?
>

I have never said anything about your work because I don't know where you
spoke about it or where you made some proposals. I must have missed one of
your messages (did it reach this list?). So don't take that as a personal
attack because this only started on a reply I made (the one specifically
speaking about the various ambiguities of encoded newlines in terminal
protocols, which do not match the basic plain text definition (similar to
MIME) made only for static documents, but never tuned for interactive
bidirectional use (including for example text editors, which also requires
a modelization of 2D layout, and also sets some assumptions about
"characters" visible in a single cell of a regularly spaced grid, and a
known number of lines and columns, independant of the lines of the text
rendered and read on it.

Terminals are not displaying plain text, they create their own upper layer
protocol which requires and enforces the 2D layout (whereas Unicode is a
purely linear protocol with only relations between one character and the
next one in a 1D stream, and no assumption at all about their display
width, which cannot be monospaced in all scripts and are definitely not
encoded in logical order: try adding characters at end of a logical line,
with a Bidi text you do not just replace the content of one cell, you have
to scroll the content of surrounding cells and your input curet position
does not necessarily changes or you'l reach a point where a visual line
will be split in two part, but not at the rest position, and some parts
moved up to down

Bidi does not specify the 2D layout completely, it is purely 1D and speaks
about left and right direction and does not specify what happens when
contents do not fit on the visual line for the text which is already
present there before inserting new text or even what will be replaced if
you are in replace mode and not in insert mode: The Bidi algorithm is not
designed to handle overwrites, and not even the whole Unicoidce standard
itself, which is made as if all text was inserted only at end of lines and
not replacing anything.

For now terminal protocols, and emulators trying to implement them; that
must mix the desynchronized input and output (especially when they have to
do "local echo" of the input for performance reason over slow serial links
where there's no synchronization between the local buffer of the terminal
and the remote virtual buffer of the terminal emulator in the emitting app,
even those using the best "termcap" definitions) have no easy way to do
that. The logical encoding of Unicode does not play well and the time to
resynchronize the local and remote buffers is a limiting factor (over a
9.6kbps link, refreshing the whole screen takes too long, and this cannot
be done on every keystroke of input, or user input would have to be
dramatically slow if local echoing is also enabled, or most user inputs
that are too fast would have to be discarded, and this makes user input
very unreliable, requiring constant correction; these protocols are
definitely not human-friendly as they depend on strict timing which is not
the way humans enter text; this timing is also unpredicatable and very
variable over serial links and the protocols do not have any specification
for timing requirements. In fact time is constantly ignored, even if it
plays an evident role).

If you look at historic "terminal" protocols, technics were used to control
time: notably the XON/XOFF protocols, or mechanical constraints. Especially
when the output was a printer (with a daisywheel or matrix head). But time
was just control between one machine and another, a human could not really
interact asynchronously. And it was in a time where full-screen text
editors did not even exist (at most they were typing "on the flow" and text
layout was completely forgotten. This changed radiucally when the ouput
became a screen, with the assumption that the output was instantanous, but
the mechanical restrictions were removed.

Some older terminal protocols for mainframes notably were better than
today's VT-like protocols: you did not transmit just what would be
displayed, but you also described the screen area where user input is
allowed and the position of fields and navigation between them: the
terminal had then no difficulty to avoid breaking the output when 

Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-07 Thread Eli Zaretskii via Unicode
> From: Egmont Koblinger 
> Date: Thu, 7 Feb 2019 19:01:33 +0100
> Cc: Richard Wordingham , 
>   unicode Unicode Discussion 
> 
> On Thu, Feb 7, 2019 at 6:53 PM Eli Zaretskii  wrote:
> 
> > No, it needs no interaction.  Unless the regexp doesn't work for you,
> > which you should then report as a bug in Emacs.
> 
> Do you mean you aim to maintain a regex that matches everyone's prompt
> in the world, without a significant amount of false positive matches
> on non-prompt lines?

Yes.


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-07 Thread Egmont Koblinger via Unicode
Hi Philippe,

On Thu, Feb 7, 2019 at 3:21 PM Philippe Verdy  wrote:

> "Rules" are not formally written, they are just a sense of best practices.

When it comes to BiDi in terminals, I haven't seen anything that I
consider reasonably okay, let alone "best practice". It's a mess.
That's why I decided to come up with something.

> Bidi plays very badly on terminals

Agreed. There's essentially two ways from here: just leave it as bad
as it is (or even see various terminal emulators coming up with not
well-thought-out hacks that just make it even worse) or try to
improve. I picked the latter.

> [...] refreshing a typical 80x25 screen takes about one half second, which is 
> much longer than typical user input, so full screen refresh does not work for 
> data input and editing, and terminals implement themselves the echo of user 
> input, ignoring how and when the receiving application will handle the input, 
> and also ignoring if the applciation is already sending ouput to the terminal.

I'm really unsure where you're trying to get with it.

For one, adding BiDi doesn't introduce the need for significantly
larger updates. Whenever a partial repaint of the screen was
sufficient, even with BiDi in the game it will remain sufficient.

Another thing: I'm not sure that 9.6kbps is a bottleneck to worry
about. It's present if you connect to a device via serial port, but
will you really do this in combination with BiDi? The use case I much
more have in mind is running a terminal emulator locally, or ssh'ing
to a remote matchine, for getting various kinds of productive work
done (e.g. wriiting a text file in someone's native RTL script in a
text editor). These are magnitudes faster.

> It's hard or impossible to synchroinize this and local echoes on the terminal 
> causes havoc.

If input mixes with output (e.g. you press some keys while you're
waiting for make/gcc to compile your app, and these letters appear
onscreen), the visual result is broken even without BiDi. I cannot
elimite this kind of breakage by introducing BiDi, nor can I build up
something from scratch that somewhat resembles the current terminal
emulator world but fixes all of its oddnesses.

> But the concept of "line" or "paragraph" in a terminal protocols is extremely 
> fuzzy. It's then very difficult to take into account the additiona Bidi 
> contraints as it's impossible to conciliate BOTH the logical ordering (what 
> is encoded in the transmitted data or kept in history buffers) and the visual 
> ordering.

I don't try to conciliate logical and visual ordering within the same
paragraph, I agree it's impossible, it's a semantical nonsense. But I
try to conciliate them in the sense that sometimes the visual order is
the desired one, sometimes the logical order, so let's make it
possible to use one for one paragraph, and the other one for another
paragraph.

> That's why there are terminal protocols that absolutely don't want to play 
> with the logical ordering and require all their data to be transmitted in 
> visual order (in which case, there's no bidi handling at all).

This is one of the modes in my recommendation. If your application
requires this mode (as e.g. Emacs does), use this mode and you're
good.

> In fact most terminal protocols are very defective and were never dessign to 
> handle Bidi input

Maybe it's high time someone fixed this defect, then? :)

> And here your unit (logical lines) is not even defined in the terminal 
> protocol and not known from the meitting applications whjich has no input 
> about the final output terminal properties. So the terminal must perform 
> guesses. As it can insert additional linebreaks itself, and scroll out some 
> portion of it, there's no way to delimit the effect of "bidi controls". The 
> basic requirement for correctly handling bidi controls is to make sure that 
> paragraph delimitations are known and stable. if additional breaks can occur 
> anywhere on what you think is a "logical line" but which is different from 
> the mietting application (or static text document which is ouput "as is" 
> without any change to reformat it, these bidi controls just make things worse 
> and it becomes impossible to make reasonnable guesses about paragraph 
> delimitations in the terminal. The result become unpredictable and most often 
> will not even make any sense as the terminal uses visual ordering always but 
> looses the tr!
 ack of the logical ordering (and things get worse when there are complex 
clusters or characters that cannot even fit in a monospaced grid.

If an exact definition of hard vs. soft wrapped lines is what you miss
from the specification, okay, I'll add it to a future version.

I don't know how terminals performing guesses occured to you, they
sure don't (as for hard vs. soft newlines).

> The basic requirement for correctly handling bidi controls is to make sure 
> that paragraph delimitations are known and stable.

Since we're talking about bidi controls being emitted, 

Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-07 Thread Egmont Koblinger via Unicode
On Thu, Feb 7, 2019 at 6:53 PM Eli Zaretskii  wrote:

> No, it needs no interaction.  Unless the regexp doesn't work for you,
> which you should then report as a bug in Emacs.

Do you mean you aim to maintain a regex that matches everyone's prompt
in the world, without a significant amount of false positive matches
on non-prompt lines?

(It's getting damn off-topic though.)


e.


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-07 Thread Eli Zaretskii via Unicode
> From: Egmont Koblinger 
> Date: Thu, 7 Feb 2019 18:20:02 +0100
> Cc: Richard Wordingham , 
>   unicode Unicode Discussion 
> 
> > It uses a regular expression, see term-prompt-regexp.
> 
> So, it's not automatic, needs user interaction

No, it needs no interaction.  Unless the regexp doesn't work for you,
which you should then report as a bug in Emacs.


Re: Bidi paragraph direction in terminal emulators

2019-02-07 Thread Egmont Koblinger via Unicode
On Thu, Feb 7, 2019 at 6:33 PM Eli Zaretskii  wrote:

> Well, let's just say that Emacs uses the HL1 rule, and determines the
> base direction for the entire chunk of text between empty lines.

Exactly!

Now it's my turn to figure out how to add this behavior to terminals,
preferably stopping before/after prompts too.


cheers,
egmont


Re: Bidi paragraph direction in terminal emulators

2019-02-07 Thread Eli Zaretskii via Unicode
> From: Egmont Koblinger 
> Date: Thu, 7 Feb 2019 18:12:37 +0100
> Cc: Richard Wordingham , 
>   unicode Unicode Discussion 
> 
> I believe it's not my mental model that's weird, but your use of
> terminology that doesn't match UBA's that confused me.

Well, let's just say that Emacs uses the HL1 rule, and determines the
base direction for the entire chunk of text between empty lines.


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-07 Thread Egmont Koblinger via Unicode
Hi,

On Thu, Feb 7, 2019 at 3:27 PM Eli Zaretskii  wrote:

> It uses a regular expression, see term-prompt-regexp.

So, it's not automatic, needs user interaction, and for that reason,
may not have worked for me. (I have other weird things in my prompt,
like 256-color sequences that Emacs didn't recognize, perhaps this
made the regexp matching fail. Nevermind.)

> > Whatever it does to know where the prompt is, can it be made into a
> > standard, cross-terminal feature?
>
> Not sure.  It's a kind of heuristic, which is why the regexp is
> customizable on user level, so that users could adapt it to their
> needs, should that be necessary.

iTerm2 has a "shell integration" where the prompt contains explicit
markers so that no heuristics or user configuration is needed from the
terminal. We're trying to somewhat standardize it at
https://gitlab.freedesktop.org/terminal-wg/specifications/issues/4 and
get more terminals support it. Not sure where this attempt will take
us, we'll see.

> In what version of Emacs is that?  In the latest version 26 I have
> here, the tutorial displays with most paragraphs in RTL direction.

25.2 here, it might have obviously changed for a newer version, glad to hear it.

My distro will upgrade in about 2 months. Since I'm not an Emacs user
myself, I hope you don't mind if I don't make extra rounds in
upgrading now to verify this.


cheers,
egmont


Re: Bidi paragraph direction in terminal emulators

2019-02-07 Thread Egmont Koblinger via Unicode
On Thu, Feb 7, 2019 at 3:14 PM Eli Zaretskii  wrote:

> Not a bug, a feature.  Emacs doesn't remove the bidi controls from
> display (that's another deviation allowed by the UBA, see section
> 5.2).  On GUI displays, these controls are displayed as thin 1-pixel
> spaces, but on text-mode terminals they are shown as space.

Thanks for the clarification!

> Why?  As I said, the tutorial was written in part to demonstrate the
> UBA implementation, including the dynamic detection of base paragraph
> direction, and this is exactly one example of how it works in
> practice.

Fair enough, then.

> > To recap: The _paragraph direction_ is determined in Emacs for
> > emptyline-delimited segments of data, which I honestly find a great
> > thing, and would love to do in terminals too, alas at this point it's
> > blocked by some really nontrivial technical issues. But once you have
> > decided on a direction, each _line_ within that data is passed
> > separately to the BiDi algorithm to get reshuffled
>
> Yes and no.  You could keep your mental model if you like, but
> actually the UBA explicitly says that each line is to be reordered for
> display separately, see section 3.4 of UAX#9.

The very first step of the BiDi algorithm is to split at "paragraphs",
however that's defined, and then do the rest for each paragraph.

For one particular paragraph, there's a lot going on: determining
embedded levels and such. At one point, at the very beginning of 3.4,
a caller may split a paragraph into lines. Then the rest (actual
reordering) happens on lines.

This is _not_ the same as splitting into lines upfront (that is,
define UBA's "paragraphs" as the text file's "lines"), and then
determining embedded levels and reshuffling on these smaller units.

Emacs does the latter, and so does my specification.

I believe it's not my mental model that's weird, but your use of
terminology that doesn't match UBA's that confused me. It's pretty
confusing and obviously hard to use the proper terminology, since
Emacs's definition and the user-perceived notion of a "paragraph"
differs from what becomes a "paragraph" according to UBA's definition.

Both in Emacs and in my spec, a "line" of the text file maps to a
"paragraph" according to UBA's phrasing. (Except when determining the
paragraph direction, where Emacs uses its own human-perceived
emptyline-separated paragraph, rather than lines. Which is a nice
thing to do.)

Anyways, I'm glad it turned out we're on the same page, it's just the
terminology that's truly confusing.


cheers,
egmont


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-07 Thread Eli Zaretskii via Unicode
> Date: Thu, 7 Feb 2019 00:45:55 +0100
> Cc: unicode Unicode Discussion 
> From: Egmont Koblinger via Unicode 
> 
> > Not necessarily.  One could allow the first strong character in the
> > prompt to determine the paragraph directions
> 
> How does Emacs know what's a prompt? How can it tell it from the
> previous and next command's output?

It uses a regular expression, see term-prompt-regexp.

> Whatever it does to know where the prompt is, can it be made into a
> standard, cross-terminal feature?

Not sure.  It's a kind of heuristic, which is why the regexp is
customizable on user level, so that users could adapt it to their
needs, should that be necessary.

> > That's what the Emacs
> > terminal (invoked by M-x term; top level definition in term.el) does.
> 
> I tried it. Executed my default shell, and inside that, a "cat
> TUTORIAL.he". All the paragraphs are rendered as LTR ones,
> left-aligned. Not the way the file is opened in Emacs.

In what version of Emacs is that?  In the latest version 26 I have
here, the tutorial displays with most paragraphs in RTL direction.

> If you claim Emacs's built-in terminal emulator supports BiDi, I'm
> kindly asking you to present a documentation of its behavior, in
> similar spirit to my BiDi proposal.

The Emacs terminal emulator displays text as any other text in any
other Emacs buffer, so it supports the same bidi reordering as
elsewhere.  You could make it emulate other terminals by setting the
variable bidi-paragraph-direction to either left-to-right or
right-to-left, then all the paragraphs will have the base direction
you specify.  But the default value of this variable in term buffers
is nil, which invokes dynamic determination of base paragraph
direction.


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-07 Thread Eli Zaretskii via Unicode
> Date: Wed, 6 Feb 2019 23:32:43 +
> From: Richard Wordingham via Unicode 
> 
> > You define paragraphs as emptyline-separated blocks on which you
> > perform autodetection of the paragraph direction. This is great! As
> > I've mentioned, I'd love to have such a mode in terminals, but it's
> > subject to underlying improvements, like knowing when a prompt starts
> > and ends, because prompts also have to be paragraph delimiters.
> 
> Not necessarily.  One could allow the first strong character in the
> prompt to determine the paragraph directions.  That's what the Emacs
> terminal (invoked by M-x term; top level definition in term.el) does.

Emacs's built-in terminal emulator does that only because no one
bothered to do something about this behavior.  I personally don't
consider this the correct behavior (but then I don't use M-x term in
Emacs except for testing).  Emacs does know where the prompt is, so it
could implement the rule that whatever follows the prompt starts a new
paragraph.


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-07 Thread Philippe Verdy via Unicode
Le jeu. 7 févr. 2019 à 13:29, Egmont Koblinger  a écrit :

> Hi Philippe,
>
> > There's some rules for correct display including with Bidi:
>
> In what sense are these "rules"? Where are these written, in what kind
> of specification or existing practice?
>

"Rules" are not formally written, they are just a sense of best practices.
Bidi plays very badly on terminals (even enhanced terminals like VT-* or
ANSI that expose capabilities when, most of the time, these capabilities
are not even accessible: it is too late and further modifications of the
terminal properties (notably its display size) can never be taken into
account (it is too late, the ouput has been already generated, and all what
the terminal can do is to play with what is in its history buffers). Even
on dual-channel protocols (input and output), terminal protocols are also
not synchronizing the input and the output and these asynchrnous channels
ignore the transmission time between the terminal and the aware
application, so the terminal protocol must include a functio nthat allows
flushing and redrawing the screen completely (but this requires long
delays). With a common 9.6kbps serial link, refreshing a typical 80x25
screen takes about one half second, which is much longer than typical user
input, so full screen refresh does not work for data input and editing, and
terminals implement themselves the echo of user input, ignoring how and
when the receiving application will handle the input, and also ignoring if
the applciation is already sending ouput to the terminal.
It's hard or impossible to synchroinize this and local echoes on the
terminal causes havoc.
I've not seen any way for a terminal to handle all these constraints. So
the only way for them is to support them only plain-text basic documents,
formatted reasonnably, and inserting layout "hints" in the format of their
output so that termioanl can perform reasonnable guesses and adapt.
But the concept of "line" or "paragraph" in a terminal protocols is
extremely fuzzy. It's then very difficult to take into account the
additiona Bidi contraints as it's impossible to conciliate BOTH the logical
ordering (what is encoded in the transmitted data or kept in history
buffers) and the visual ordering. That's why there are terminal protocols
that absolutely don't want to play with the logical ordering and require
all their data to be transmitted in visual order (in which case, there's no
bidi handling at all). Then terminals will attempt to consiliate the visual
line delimitations (in the transmitted data) with the local-only
capabilities of the rendered frame. Many terminals will also not allow
changing the display width, will not allow changing the display cell size,
will force constraints on cell sizes and fonts, and then won't be able to
correctly output many Asian scripts.
In fact most terminal protocols are very defective and were never dessign
to handle Bidi input, and Asian scripts with compelx clusters and variable
fonts that are needed for them (even CJK scripts which use a mix of
"half-wifth" and "full-width" characters.

> - Separate paragraphs that need a different default Bidi by double
> newlines (to force a hard break)
>
> There is currently no terminal emulator I'm aware of that uses empty
> lines as boundaries of BiDi treatment.
>

These are hint in absence of something else, and it plays a role when the
terminal disaply width is unpredicable by the application making the output
and having no access to any return input channel.
Take the example of terminal emulators in resizable windows: the display
width is undefined, but there's not any document level and no buffering,
scrolling text will flush the ouput partially, history is limited A
terminal emulator then needs hints about where paragrpahs are delimited and
most often don't have any other distinctions available even in their
limited history that allows distinguishing the 3 main kinds of line breaks.


> While my recommendation uses a one smaller unit (logical lines), and I
>

And here your unit (logical lines) is not even defined in the terminal
protocol and not known from the meitting applications whjich has no input
about the final output terminal properties. So the terminal must perform
guesses. As it can insert additional linebreaks itself, and scroll out some
portion of it, there's no way to delimit the effect of "bidi controls". The
basic requirement for correctly handling bidi controls is to make sure that
paragraph delimitations are known and stable. if additional breaks can
occur anywhere on what you think is a "logical line" but which is different
from the mietting application (or static text document which is ouput "as
is" without any change to reformat it, these bidi controls just make things
worse and it becomes impossible to make reasonnable guesses about paragraph
delimitations in the terminal. The result become unpredictable and most
often will not even make any sense as the terminal uses visual ordering

Re: Bidi paragraph direction in terminal emulators

2019-02-07 Thread Eli Zaretskii via Unicode
> From: Egmont Koblinger 
> Date: Wed, 6 Feb 2019 22:01:59 +0100
> Cc: Richard Wordingham , unicode@unicode.org
> 
> - Emacs running in a terminal shows an underscore wherever there's a
> BiDi control in the source file – while the graphical one doesn't.
> This looks like a simple bug to me, right?

Not a bug, a feature.  Emacs doesn't remove the bidi controls from
display (that's another deviation allowed by the UBA, see section
5.2).  On GUI displays, these controls are displayed as thin 1-pixel
spaces, but on text-mode terminals they are shown as space.  The
underscore you see is a special typeface used to indicate that this is
not really a space.  (This is the default; Emacs being Emacs, it
allows to customize how these characters are displayed, and in
particular not to display them at all.)

> - Line 1007, the copyright line of this file uses visual indentation,
> and Emacs detects LTR paragraph for that line. I think it should
> rather use BiDi controls to have an overall RTL paragraph direction
> detected, and within that BiDi controls to force LTR for the text.

Why?  As I said, the tutorial was written in part to demonstrate the
UBA implementation, including the dynamic detection of base paragraph
direction, and this is exactly one example of how it works in
practice.

> To recap: The _paragraph direction_ is determined in Emacs for
> emptyline-delimited segments of data, which I honestly find a great
> thing, and would love to do in terminals too, alas at this point it's
> blocked by some really nontrivial technical issues. But once you have
> decided on a direction, each _line_ within that data is passed
> separately to the BiDi algorithm to get reshuffled

Yes and no.  You could keep your mental model if you like, but
actually the UBA explicitly says that each line is to be reordered for
display separately, see section 3.4 of UAX#9.

> Let's make a thought experiment. Let's assume that for running the
> BiDi algorithm, we'd still stick to the emptyline-delimited paragraph
> definition. This is not what you do, this is not what I do, but I
> misunderstood that this is what you did, and I also thought this was a
> good idea as a potential extension for the BiDi specs – I no longer
> think so. This definition is truly problematic, as I'll show below.

Which is why it is not what the UBA says one should do.


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-07 Thread Egmont Koblinger via Unicode
Hi Philippe,

> There's some rules for correct display including with Bidi:

In what sense are these "rules"? Where are these written, in what kind
of specification or existing practice?

> - Separate paragraphs that need a different default Bidi by double newlines 
> (to force a hard break)

There is currently no terminal emulator I'm aware of that uses empty
lines as boundaries of BiDi treatment.

While my recommendation uses a one smaller unit (logical lines), and I
understand as per Eli's request that it would be desireable to go with
emptyline-delimited boundaries, what in fact all the current
self-proclaimed BiDi-aware terminal emulators that I came across do is
use a unit two steps smaller than yours: they do BiDi on physical
lines of the terminal, no matter how a logical line of the output had
to wrap into physical ones because didn't fit in the width. (It's a
terrible behavior.)

The current behavior of terminal emulators is very far from what you describe.

> - use a single newline on continuation

Continuation of what exactly?

But let's take a step back: Should the output be pre-formatted by some
means, or do we rely on the terminal emulator wrapping overlong lines?
(If pre-formatted then for what width? 80 columns, so that I waste
precious real estate if my terminal is wider? Or is it a requirement
for any app that produces output to implement a decent dynamic
wrapping engine for nice formatting according to the actual width?)

There's precedence for both of these different approaches. I don't
think it's feasible to pick one, and claim that the other approach is
discouraged/invalid/whatever.

> - if technical items are untranslatable, make sure they are at the begining 
> of lines and indented by some leading spaces, before translated ones.

I firmly disagree. There shouldn't be any restriction on how a
translator wishes to translate a sentence. The computer world has to
adapt to the requirements of human languages, not the other way
around!

> - Don't use any Bidi control !

Why not? They do exist for a reason, for the very reason that any
logical translation, which a translator might want to write (see my
previous point) is presentable in a visually correct way. Use them for
that, whenever needed.


cheers,
egmont


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-07 Thread Richard Wordingham via Unicode
On Thu, 7 Feb 2019 00:45:55 +0100
Egmont Koblinger via Unicode  wrote:

> Hi Richard,
> 
> > Not necessarily.  One could allow the first strong character in the
> > prompt to determine the paragraph directions  
> 
> How does Emacs know what's a prompt? How can it tell it from the
> previous and next command's output?

I don't believe the Emacs terminal does either.  What's special about
the prompt is that it starts a line, so most paragraphs start with a
prompt.  Not all prompts contain a strong character.  To let a file's
contents control directionality, instead of issuing the command 'cat
file1' one would have to issue a shell command '(echo; cat file1)' or
similar to terminate the paragraph containing the prompt.  The 'echo'
inserts an empty line.

> > That's what the Emacs
> > terminal (invoked by M-x term; top level definition in term.el)
> > does.  
> 
> I tried it. Executed my default shell, and inside that, a "cat
> TUTORIAL.he". All the paragraphs are rendered as LTR ones,
> left-aligned. Not the way the file is opened in Emacs.

See above.  I don't know how what your shell is.

> If you claim Emacs's built-in terminal emulator supports BiDi, I'm
> kindly asking you to present a documentation of its behavior, in
> similar spirit to my BiDi proposal.

I've a feeling it has emergent behaviour, and may require a lot of
experimentation to elucidate.

> Does this logic also apply to single newline characters? If not, why
> not, what's the conceptual difference? If it does, why do text files
> end in a newline?

I don't like the convention that removing the newline from the end of a
non-empty line changes it into a binary file.  The short answer is that
some editors allow a text file not to have a final newline; such files
are not handled well in the Unix environment.

Some things are just untidy messes.  Compare C, where a semicolon
*terminates* statements, but some are terminated by '}', and a
semicolon *separates* the expression within the control part of a for
statement, and a comma *separates* the constant definitions in an enum
declaration - for a long time, a trailing comma inside the braces was
illegal.

Richard. 


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-06 Thread Philippe Verdy via Unicode
I read your email, you spoke for example about how a typical Unix/Linux
tool shows its usage option (e.g. "anycommand --help") with a leading line
then syntaxes and tabulated lists of options followed by translated help on
the same line.

There's some rules for correct display including with Bidi:

- Separate paragraphs that need a different default Bidi by double newlines
(to force a hard break)
- use a single newline on continuation
- if technical items are untranslatable, make sure they are at the begining
of lines and indented by some leading spaces, before translated ones.
- avoid breaking lists
- try to separate as much as posible text in natural languages from
technical texts.
- Be careful about correcty usage of leading punctuations (notably for list
items)
- Be consistant about indentation
- Normalize spaces,
- Don't ussume that TAB controls have the same width (ban TABS except at
the begining of lines)
- In column output, separate colums always with at least two spaces, don't
glue them as if they were sentences.
- Don't use "soft line breaks" in the middle of short lines (less than 72
base characters)
- Don't use any Bidi control !

With some cares, you can perfectly translate Linux/Unix tools in languages
needing Bidi and get consistant output, but be careful if your text
contains placeholders or technihcal untranslated terms (make sure to
surround them with paired punctuation, or don't translate them at all. And
avoid paragraphs that would mix natural and technical untranslatable terms
(such as command names or command-line options).

Make sure to test the output so that it will also work with varaible fonts
(don't assume monospaced fonts are used, they do not exist for various
scripts and don't work reliably for Arabic and most Asian scripts, and not
even for Chinese or Japanese even if these don't need Bidi support).

But the difficulty is not really in the terminal emulators but in the
source texts given to translators, when they don't know the context in
which the text will be used and have no hint about which terms should not
be translated (because they can become inconsistant: there are many
examples, even in Windows 10, where some of the command line tools are
completely unusable with the translated UI and with examples of syntaxes
that are not even working where some terms were randomly and inconsistantly
translated or confused, or because tools assumed an LTR-only layout of the
output, and monospaced fonts with one-to-one character per display cell, or
requiring specific fonts that do not contain the characters in their
monospaced variants: this is challenging notably for Asian scripts needing
complex clusters if you made these Latin-based assumptions)


Le mer. 6 févr. 2019 à 22:30, Egmont Koblinger  a écrit :

> Hi Philippe,
>
> Thanks a lot for your input!
>
> Another fundamental difficulty with terminal emulators is: These
> controls (CR, LF...) are control instructions that move the cursor in
> some ways, and then are forgotten. You cannot do BiDi on the
> instructions the terminal receives. You can only do BiDi on the
> result, the contents of the canvas after these instructions are
> executed. Here these controls are either lost, or you have to give a
> specification how exactly they need to be remembered, i.e. converted
> to being part of the canvas's data.
>
> Let's also mention that trying to get apps into using them is quite
> hopeless. The best you can do is design BiDi around what you already
> have, which pretty much means hard vs. soft line endings, and
> hopefully forthcoming semantical marks around shell prompts. (To
> overcomplicate the story, a received LF doesn't convert the line
> ending to hard wrapped in most terminal emulators. In some it does. I
> don't think there's an exact specification anywhere. Maybe the BiDi
> spec needs to create one. Lines are hard wrapped by default, turned to
> soft wrapped when the text gets wrapped at the end of the line, and a
> few random control functions turn them back to hard one, but in most
> terminals, a newline is not such a control function.)
>
> Anyway, please also see my previous email; I hope that clarifies a lot
> for you, too.
>
>
> cheers,
> egmont
>
> On Tue, Feb 5, 2019 at 5:53 PM Philippe Verdy via Unicode
>  wrote:
> >
> > I think that before making any decision we must make some decision about
> what we mean by "newlines". There are in fact 3 different functions:
> > - (1) soft line breaks (which are used to enforce a maximum display
> width between paragraph margins): these are equivalent to breakable and
> compressible whitespaces, and do not change the logical paragraph
> direction, they don't insert any additionnal vertical gap between lines, so
> the logicial line-height is preserved and continues uninterrupted. If text
> justification applies, this whitespace will be entirely collapsed into the
> end margin, and any text before it will stilol be justified to match the
> end margin (until the maximum 

Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-06 Thread Egmont Koblinger via Unicode
Hi Richard,

> Not necessarily.  One could allow the first strong character in the
> prompt to determine the paragraph directions

How does Emacs know what's a prompt? How can it tell it from the
previous and next command's output?

Whatever it does to know where the prompt is, can it be made into a
standard, cross-terminal feature?

> That's what the Emacs
> terminal (invoked by M-x term; top level definition in term.el) does.

I tried it. Executed my default shell, and inside that, a "cat
TUTORIAL.he". All the paragraphs are rendered as LTR ones,
left-aligned. Not the way the file is opened in Emacs.

If you claim Emacs's built-in terminal emulator supports BiDi, I'm
kindly asking you to present a documentation of its behavior, in
similar spirit to my BiDi proposal.

> Not necessarily.  One might use cat to glue together files that had
> split into 1400k chunks, in which case it is not even reasonable to
> expect the end of file to be at a character boundary.  (Yes, floppy
> disks still have their uses.)

I did not say anything about changing cat's behavior. I recommended to
change the convention for such paragraph-oriented text files to end
with two newlines.

> But the white space between paragraphs is a separator, not a
> terminator.  One doesn't require it at the end when formatting
> paragraphs within the cell of a table.

Does this logic also apply to single newline characters? If not, why
not, what's the conceptual difference? If it does, why do text files
end in a newline?


e.


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-06 Thread Richard Wordingham via Unicode
On Wed, 6 Feb 2019 22:01:59 +0100
Egmont Koblinger via Unicode  wrote:

> Hi Eli,
> 
> (I'm getting lost where to reply, and how the subject gets mangled and
> the thread split into different ones.)
> 
> 
> I've thought about it a lot, experimented with Emacs's behavior, and
> I've arrived at the conclusion that we are actually much closer to
> each other than I had thought. Probably there's a lot of
> misunderstanding due to different terminology we used.
> 
> I've set my terminal to RTL paragraph direction (via the relevant
> escape sequence), then did a "cat TUTORIAL.he" (the file taken from
> 26.1), and compared to what I see in Emacs 25.2.2 – both the graphical
> one, and the one running in a terminal of no BiDi.
> 
> Apart from a few minor irrelevant differences, they look the same!
> Hooray!!!
> 
> (The differences are:
> 
> - I had to slightly modify TUTORIAL.he to make sure none of the lines
> start with a BiDi control (I added a preceding character) because
> currently VTE doesn't support them, there's no character cell to store
> this data. This definitely needs to be fixed in the second version of
> my proposal.
> 
> - Emacs running in a terminal shows an underscore wherever there's a
> BiDi control in the source file – while the graphical one doesn't.
> This looks like a simple bug to me, right?
> 
> - Line 1007, the copyright line of this file uses visual indentation,
> and Emacs detects LTR paragraph for that line. I think it should
> rather use BiDi controls to have an overall RTL paragraph direction
> detected, and within that BiDi controls to force LTR for the text. The
> terminal shows it with RTL direction, as I manually set it.
> 
> Again, all these three details are irrelevant to my point, namely that
> in WIP gnome-terminal it looks the same as in Emacs.)
> 
> 
> You define paragraphs as emptyline-separated blocks on which you
> perform autodetection of the paragraph direction. This is great! As
> I've mentioned, I'd love to have such a mode in terminals, but it's
> subject to underlying improvements, like knowing when a prompt starts
> and ends, because prompts also have to be paragraph delimiters.

Not necessarily.  One could allow the first strong character in the
prompt to determine the paragraph directions.  That's what the Emacs
terminal (invoked by M-x term; top level definition in term.el) does.

> On a nitpicking side note:
> 
> It's damn ugly not to terminate a text file with a newline. Newline is
> much better thought of a "terminator" than a "delimiter". For example,
> if you do a "cat file1 file2", you expect file2 to start on its own
> line.

Not necessarily.  One might use cat to glue together files that had
split into 1400k chunks, in which case it is not even reasonable to
expect the end of file to be at a character boundary.  (Yes, floppy
disks still have their uses.)

> Shouldn't this apply to paragraphs, too, especially when BiDi is in
> the game? I'd argue that an empty line (double newline) shouldn't be a
> delimiter, it should be a terminator for a paragraph.

But the white space between paragraphs is a separator, not a
terminator.  One doesn't require it at the end when formatting
paragraphs within the cell of a table. 

Richard.



Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-06 Thread Egmont Koblinger via Unicode
Hi,

I was loose with my terminology once again, which is not a wise thing
when you're trying to clarify misunderstandings :)

> But once you have
> decided on a direction, each _line_ within that data is passed
> separately to the BiDi algorithm to get reshuffled; this is what Emacs
> does, this is what my specification says, and this is the right thing.
> That is, for this step, the definition of "paragraph", as the BiDi
> algorithm uses this term, is a line of the text file.

I keep thinking of the BiDi algorithm as one that takes a single
paragraph, because that's how I use it in VTE. But in fact, the BiDi
algorithm starts by splitting into paragraphs. I keep forgetting about
this outermost "for loop" of the BiDi algo.

And with that, proper definition, you can of course pass the entire
emptyline-delimited segment into the BiDi algorithm in a single step.
In its first phase, the BiDi algorithm will split it at newlines,
because for the BiDi algorithm (but not when detecting the paragraph
direction in Emacs), newline is the paragraph delimiter. Then it will
execute the rest of the algorithm for each paragraph (that is: line)
separately.

This is exactly the same as splitting manually, and then for each line
invoking the BiDi algorithm.


cheers,
egmont


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-06 Thread Egmont Koblinger via Unicode
Hi Philippe,

Thanks a lot for your input!

Another fundamental difficulty with terminal emulators is: These
controls (CR, LF...) are control instructions that move the cursor in
some ways, and then are forgotten. You cannot do BiDi on the
instructions the terminal receives. You can only do BiDi on the
result, the contents of the canvas after these instructions are
executed. Here these controls are either lost, or you have to give a
specification how exactly they need to be remembered, i.e. converted
to being part of the canvas's data.

Let's also mention that trying to get apps into using them is quite
hopeless. The best you can do is design BiDi around what you already
have, which pretty much means hard vs. soft line endings, and
hopefully forthcoming semantical marks around shell prompts. (To
overcomplicate the story, a received LF doesn't convert the line
ending to hard wrapped in most terminal emulators. In some it does. I
don't think there's an exact specification anywhere. Maybe the BiDi
spec needs to create one. Lines are hard wrapped by default, turned to
soft wrapped when the text gets wrapped at the end of the line, and a
few random control functions turn them back to hard one, but in most
terminals, a newline is not such a control function.)

Anyway, please also see my previous email; I hope that clarifies a lot
for you, too.


cheers,
egmont

On Tue, Feb 5, 2019 at 5:53 PM Philippe Verdy via Unicode
 wrote:
>
> I think that before making any decision we must make some decision about what 
> we mean by "newlines". There are in fact 3 different functions:
> - (1) soft line breaks (which are used to enforce a maximum display width 
> between paragraph margins): these are equivalent to breakable and 
> compressible whitespaces, and do not change the logical paragraph direction, 
> they don't insert any additionnal vertical gap between lines, so the logicial 
> line-height is preserved and continues uninterrupted. If text justification 
> applies, this whitespace will be entirely collapsed into the end margin, and 
> any text before it will stilol be justified to match the end margin (until 
> the maximum expansion of other whitespaces in the middle is reached, and the 
> maximum intercharacter gap is also reached (in which case, that line will not 
> longer be expanded more), but this does not apply to terminal emulators that 
> noramlly never use text justification, so the text will just be aligned to 
> the start margin and whitespaces before it on the same line are preserved, 
> and collapsed only at end of the line (just before the soft line break itself)
> - (2) hard line breaks: they break to a new line but continue the paragraph 
> within its same logical direction, but they are not compressible whitespaces 
> (and do not depend on the logical end margin of the paragraph.
> - (3) paragraph breaks: generally they introduce an addition vertical gap 
> with top and bottom margins
>
> The problem in terminals is that they usually cannot distinguish types (1) 
> and (2), they are simply encoded by a single CR, or LF, or CR+LF, or NEL. 
> Type (1) is only existing within the framework of a higher level protocol 
> which gives additional interpretation to these "newlines". The special 
> control LS is almost never used but may be used for type (1) i.e. soft 
> line-breaks, and will fallback to type (2) which is represented by the legacy 
> "simple" newlines (single CR, or single LF, or single CR+LF, or single NEL). 
> I have seen very little or no use of the LS (line separator) special control.
>
> Type (3) may be encoded with PS (paragraph separator), but in terminals (and 
> common protocols line MIME) it is usually encoded using a couple of newline 
> (CR+CR, or LF+LF, or CR+LF+CR+LF, or NL+NL) possibly with additional 
> whitespaces (and additional presentation characters such as ">" in quotations 
> inserted in mail responses) between them (needed for MIME and HTTP) which may 
> be collapsed when rendering or interpreting them.
>
> Some terminal protocols can also use other legacy ASCII separators such as 
> FS, GS, RS, US for grouping units containing multiple paragraphs, or STX/EOT 
> pairs for encapsulating whole text documents in an protocol-specific 
> enveloppe format (and will also use some escaping mechanism for special 
> controls found in the middle, such as DLE+control to escape the control, or 
> DLE+0 to escape a NUL, or DLE+# to escape a DEL, or DEL+x+NN where N are a 
> fixed number of hexadecimal, decimal or octal digits. There's a wide variety 
> of escaping mechanisms used by various higher-layer protocols (including 
> transport protocols or encoding syntaxes used just below the plain-text 
> layer, in a lower layer than the transport protocol layer).
>
> Le lun. 4 févr. 2019 à 21:46, Eli Zaretskii via Unicode  
> a écrit :
>>
>> > Date: Mon, 4 Feb 2019 19:45:13 +
>> > From: Richard Wordingham via Unicode 
>> >
>> > Yes.  If one has a text composed of LTR and RTL 

Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-06 Thread Egmont Koblinger via Unicode
Hi Eli,

(I'm getting lost where to reply, and how the subject gets mangled and
the thread split into different ones.)


I've thought about it a lot, experimented with Emacs's behavior, and
I've arrived at the conclusion that we are actually much closer to
each other than I had thought. Probably there's a lot of
misunderstanding due to different terminology we used.

I've set my terminal to RTL paragraph direction (via the relevant
escape sequence), then did a "cat TUTORIAL.he" (the file taken from
26.1), and compared to what I see in Emacs 25.2.2 – both the graphical
one, and the one running in a terminal of no BiDi.

Apart from a few minor irrelevant differences, they look the same! Hooray!!!

(The differences are:

- I had to slightly modify TUTORIAL.he to make sure none of the lines
start with a BiDi control (I added a preceding character) because
currently VTE doesn't support them, there's no character cell to store
this data. This definitely needs to be fixed in the second version of
my proposal.

- Emacs running in a terminal shows an underscore wherever there's a
BiDi control in the source file – while the graphical one doesn't.
This looks like a simple bug to me, right?

- Line 1007, the copyright line of this file uses visual indentation,
and Emacs detects LTR paragraph for that line. I think it should
rather use BiDi controls to have an overall RTL paragraph direction
detected, and within that BiDi controls to force LTR for the text. The
terminal shows it with RTL direction, as I manually set it.

Again, all these three details are irrelevant to my point, namely that
in WIP gnome-terminal it looks the same as in Emacs.)


You define paragraphs as emptyline-separated blocks on which you
perform autodetection of the paragraph direction. This is great! As
I've mentioned, I'd love to have such a mode in terminals, but it's
subject to underlying improvements, like knowing when a prompt starts
and ends, because prompts also have to be paragraph delimiters. You
convinced me that it's much more important than I thought, thanks a
lot for that! I will try to see if I can push for addressing the
prerequisite issues sooner. Indeed I had to manually set RTL paragraph
direction; with manual LTR or with per-line autodetection (as VTE can
do now) the result would be much worse.


Here's how the story continues from here. Here is where we
misunderstood each other (or at the very least I misunderstood you),
although we are talking about the same, doing things the same way:

The BiDi algorithm takes a paragraph of text at a time, and somehow
reshuffles its letters. UAX#9 section 3 starts by saying that the
first main phase is separation into "paragraphs". What are those
"paragraphs" that we're takling about _now_?

The thing is, both in Emacs as well as in my specification, it's a
logical line of the text (that is: delimited by single newlines). No,
in these steps, when UBA is run, the paragraph is no longer defined as
emptyline-delimited segments, it's defined as lines of the text.

To recap: The _paragraph direction_ is determined in Emacs for
emptyline-delimited segments of data, which I honestly find a great
thing, and would love to do in terminals too, alas at this point it's
blocked by some really nontrivial technical issues. But once you have
decided on a direction, each _line_ within that data is passed
separately to the BiDi algorithm to get reshuffled; this is what Emacs
does, this is what my specification says, and this is the right thing.
That is, for this step, the definition of "paragraph", as the BiDi
algorithm uses this term, is a line of the text file. This is where I
thought we had a disagreement, but we don't, we just misunderstood
each other.

-

On a nitpicking side note:

It's damn ugly not to terminate a text file with a newline. Newline is
much better thought of a "terminator" than a "delimiter". For example,
if you do a "cat file1 file2", you expect file2 to start on its own
line.

Shouldn't this apply to paragraphs, too, especially when BiDi is in
the game? I'd argue that an empty line (double newline) shouldn't be a
delimiter, it should be a terminator for a paragraph. I think "cat
file1 file2" should make sure that the last paragraph of file1 and the
first paragraph of file2 are printed as separate paragraphs
(potentially with different paragraph direction), shouldn't it? I'd
argue that if a text file is formatted like TUTORIAL.he, with empty
lines denoting paragraph boundaries, then it should also end in an
empty line (that is: two newline characters).

-

Feel free to skip the rest :)

Let's make a thought experiment. Let's assume that for running the
BiDi algorithm, we'd still stick to the emptyline-delimited paragraph
definition. This is not what you do, this is not what I do, but I
misunderstood that this is what you did, and I also thought this was a
good idea as a potential extension for the BiDi specs – I no longer
think so. This definition is truly problematic, as I'll 

Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-05 Thread Philippe Verdy via Unicode
I think that before making any decision we must make some decision about
what we mean by "newlines". There are in fact 3 different functions:
- (1) soft line breaks (which are used to enforce a maximum display width
between paragraph margins): these are equivalent to breakable and
compressible whitespaces, and do not change the logical paragraph
direction, they don't insert any additionnal vertical gap between lines, so
the logicial line-height is preserved and continues uninterrupted. If text
justification applies, this whitespace will be entirely collapsed into the
end margin, and any text before it will stilol be justified to match the
end margin (until the maximum expansion of other whitespaces in the middle
is reached, and the maximum intercharacter gap is also reached (in which
case, that line will not longer be expanded more), but this does not apply
to terminal emulators that noramlly never use text justification, so the
text will just be aligned to the start margin and whitespaces before it on
the same line are preserved, and collapsed only at end of the line (just
before the soft line break itself)
- (2) hard line breaks: they break to a new line but continue the paragraph
within its same logical direction, but they are not compressible
whitespaces (and do not depend on the logical end margin of the paragraph.
- (3) paragraph breaks: generally they introduce an addition vertical gap
with top and bottom margins

The problem in terminals is that they usually cannot distinguish types (1)
and (2), they are simply encoded by a single CR, or LF, or CR+LF, or NEL.
Type (1) is only existing within the framework of a higher level protocol
which gives additional interpretation to these "newlines". The special
control LS is almost never used but may be used for type (1) i.e. soft
line-breaks, and will fallback to type (2) which is represented by the
legacy "simple" newlines (single CR, or single LF, or single CR+LF, or
single NEL). I have seen very little or no use of the LS (line separator)
special control.

Type (3) may be encoded with PS (paragraph separator), but in terminals
(and common protocols line MIME) it is usually encoded using a couple of
newline (CR+CR, or LF+LF, or CR+LF+CR+LF, or NL+NL) possibly with
additional whitespaces (and additional presentation characters such as ">"
in quotations inserted in mail responses) between them (needed for MIME and
HTTP) which may be collapsed when rendering or interpreting them.

Some terminal protocols can also use other legacy ASCII separators such as
FS, GS, RS, US for grouping units containing multiple paragraphs, or
STX/EOT pairs for encapsulating whole text documents in an
protocol-specific enveloppe format (and will also use some escaping
mechanism for special controls found in the middle, such as DLE+control to
escape the control, or DLE+0 to escape a NUL, or DLE+# to escape a DEL, or
DEL+x+NN where N are a fixed number of hexadecimal, decimal or octal
digits. There's a wide variety of escaping mechanisms used by various
higher-layer protocols (including transport protocols or encoding syntaxes
used just below the plain-text layer, in a lower layer than the transport
protocol layer).

Le lun. 4 févr. 2019 à 21:46, Eli Zaretskii via Unicode 
a écrit :

> > Date: Mon, 4 Feb 2019 19:45:13 +
> > From: Richard Wordingham via Unicode 
> >
> > Yes.  If one has a text composed of LTR and RTL paragraphs, one has to
> > choose how far apart their starting margins are.  I think that could
> > get complicated for plain text if the terminal has unbounded width.
>
> But no real-life terminal does.  The width is always bounded.
>


Re: Bidi paragraph direction in terminal emulators

2019-02-05 Thread Eli Zaretskii via Unicode
> From: Egmont Koblinger 
> Date: Tue, 5 Feb 2019 02:28:50 +0100
> Cc: unicode@unicode.org
> 
> I have to admit, I'm not an Emacs user, I only have some vague ideas
> how powerful a tool it is. But in its very core I still believe it's a
> text editor – is it fair to say this? It could be used for example to
> conveniently create TUTORIAL.he.

It is a text editing/processing environment which has a lot of
text-based applications built on top of it.  It could (and was) used
to create TUTORIAL.he, but it can and is used for much more.

> There are plenty of line-oriented tools.
> [...]

Actually, for every utility you mention, Emacs has a command that
either invokes the utility and presents its output, or does the same
job by using built-in features.  So most/all of the jobs you mention
are routinely done in Emacs.  After all, Emacs is a programmer's
editor at its core, so every job programmers routinely do from the
shell prompt has an equivalent feature in Emacs.  You can even run
shells inside Emacs, with Emacs serving as a terminal emulator (which
then supports bidi ;-).

> There are just sooo many use cases, it's impossible to perfectly
> address all of them at once.

I don't think you need to look for a perfect solution.  You need to
look for one that works reasonably well in practice.  It is my
experience in Emacs that the empty line as paragraph delimiter
produces much better results than if you treat each line as a separate
paragraph.  We do have in Emacs features that allow to override the
default paragraph direction, but experience shows that they are used
relatively rarely.

> I'm confident that my specification which says that it should be
> preserved as a 100 character long paragraph and passed to BiDi
> accordingly is already a significant step forward.

I agree, but I urge you to make one more step, which IME is really
necessary.


Re: Bidi paragraph direction in terminal emulators

2019-02-05 Thread Eli Zaretskii via Unicode
> From: Egmont Koblinger 
> Date: Tue, 5 Feb 2019 01:32:34 +0100
> Cc: unicode@unicode.org
> 
> On the other hand, it's not unreasonable for higher level stuff (e.g.
> shell scripts, or tools like "zip") to use such control characters.

Yes, but most of them won't ever do that.

> > No, this simple case must work reasonably well with the application
> > _completely_ oblivious to the bidi aspects. If this can't work
> > reasonably well, I submit that the entire concept of having a
> > bidi-aware terminal emulator doesn't "hold water".
> 
> There isn't a magic wand. I can't magically fix every BiDi stuff by
> changing the terminal emulator's source code.

I didn't say "magically fix", I said "work reasonably well".  I think
it would be a mistake to demand that any alternative to the default
each-line-is-a-new-paragraph method must be perfect.  It should be
enough if an alternative is better.

> What my specification essentially modifies is that with this
> specification, you at least will have a chance to get the mode right.

My experience is that this is an important feature to have, but it
will (maybe even should) be used rather rarely.  In most cases you
will just have plain text.

Moreover, emitting the control sequences that set the mode is in
itself a complication, because if the terminal doesn't support them,
the result could be corrupted display.  You will need methods of
detecting the support, and those detection methods usually involve
sending another control sequence to the terminal and waiting for
response, something that complicates applications and causes delays in
displaying output.

> In case of "zip", the creators of that software know exactly how the
> output should look like

Not necessarily true.  The translations are normally prepared by
people who are experts only in translating messages, they don't
necessarily consider layout issues, because for that you'd need to
look at the code or even run the program, something translators are
unlikely to do.

> If you're about to internationalize your software, this layout is a
> pretty bad choice.

Tell me about that!

But the reality is that this is what you get, and IMO the solution for
displaying this on a terminal should work reasonably well with that.

> This kind of formatting also ignores that English is a pretty dense
> language, in other languages the strings tend to become longer.

Actually, some/many RTL scripts tend to produce shorter text, because
vowels are not written, and because many words have very short roots.
But this is a tangent.


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators

2019-02-05 Thread Eli Zaretskii via Unicode
> Date: Tue, 5 Feb 2019 00:05:47 +
> From: Richard Wordingham via Unicode 
> 
> > > Actually, UAX#9 defines "paragraph" as the chunk of text delimited
> > > by paragraph separator characters. This means characters whose bidi
> > > category is B, which includes Newline, the CR-LF pair on Windows,
> > > U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR.
> 
> It actually gives two different definitions. Table UAX#9 4 restricts
> the type B to *appropriate newline functions; not all newlines are
> paragraph separators.

For what exactly is "appropriate newline function" one should read the
Unicode Standard, section 5.8.  My conclusions from that are different
from yours; see below.

> > Indeed, this was an oversight on my side. So, with this definition,
> > every single newline character starts a new paragraph. The result of
> > printf "Hello\nWorld\n" > world.txt
> > is a text file consisting of two paragraphs, with 5 characters in
> > each. Correct?
> 
> No, it depends on when a newline function is 'appropriate'. TUS 5.8
> Rule R2b applies - 'In simple text editors, interpret any NLF the same
> as LS'.

That's not all of what the Standard says.  Just a couple of paragraphs
above Rule R2b, there's this text:

  Note that even if an implementer knows which characters represent
  NLF on a particular platform, CR, LF, CRLF, and NEL should be
  treated the same on input and in interpretation. Only on output is
  it necessary to distinguish between them.

So in practice, IMO the above example does constitute 2 paragraphs,
regardless of the underlying platform's conventions.


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators)

2019-02-05 Thread Eli Zaretskii via Unicode
> From: Egmont Koblinger 
> Date: Tue, 5 Feb 2019 00:08:10 +0100
> Cc: unicode@unicode.org
> 
> every single newline character starts a new paragraph. The result of
> printf "Hello\nWorld\n" > world.txt
> is a text file consisting of two paragraphs, with 5 characters in each. 
> Correct?

Yes.

> > Actually, Emacs implements the rule that paragraphs are separated by
> > empty lines. This is documented in the Emacs manuals.
> 
> That is, Emacs overrides UAX#9 and comes up with a different
> definition?

Yes, Emacs uses the "higher-level protocols" clause in HL1, when the
paragraph direction is to be determined from the text.  (There's also
a way for the user or a Lisp program to force a certain base paragraph
direction on all paragraphs in a window that displays some text.)

> Furthermore, you argue that in terminals I should follow
> Emacs's definition rather than Unicode's?

IME, what Emacs uses gives much better results, yes.

> I believe I understand your concerns with the per-line paragraph
> definition, but this interpretation that I've just shown most likely
> leads to even more broken behavior.

I don't see how the result could be more broken, when the decisions
about base paragraph direction are made much more rarely.  The places
in text where the paragraph direction will be determined under my
proposal is a small subset of the places where it will be determined
by the default UBA rules.  So it will make the same mistakes as the
each-line-is-a-new-paragraph method, but there will be much fewer of
such mistakes.

In addition to this theoretical argument, I have 10 years of using
this in Emacs to back me up.  The only difference between Emacs and
your example is the very first paragraph.

> It's a really nontrivial technical problem to let the terminal
> emulator know where each prompt, and/or each command's output begins
> and ends. There's work going on for letting the terminal emulator
> recognize the prompts, but even if it's successful, it'll probably
> take 5-10 years to reach the majority of the users. And it probably
> still wouldn't solve the case of knowing the boundary between the two
> outputs if a "cat file1.txt; cat file2.txt" is executed, let alone if
> they're concatenated with "cat file1.txt file2.txt".

I think you are trying to find a perfect solution, and because it
probably doesn't exist, or at least is hard to come by, you conclude
that a solution that is imperfect should be rejected.

But I'm not saying my proposal is the perfect solution, just that it
is better (sometimes, way better) than the default of considering each
line a paragraph.

> So, what you're arguing for, is that the default behavior should be
> something that's:
> - currently not implementable in a semantically correct way (to stop
> around shell prompts) due to technical limitations, and
> - isn't what Unicode says.

The first point has to do with the search for a perfect solution.  My
advice is to settle for something reasonable even if it is not
perfect.

The second point is incorrect: the UBA explicitly allows the
implementation to apply higher-level protocols for paragraph
direction, see HL1 in UAX#9.

> You have not convinced me that the pros outweigh the cons.

There are no cons in my proposal that aren't already present in the
default each-line-is-a-new-paragraph rule.  So even if the pros don't
outweigh the cons, the balance should be better than under the default.

> That being said, I'm more than open to see such a behavior as a
> future extension, subject of course to the semantic prompt stuff
> being available.

I think the default should provide reasonably good display, and
each-line-is-a-new-paragraph doesn't.


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-04 Thread Richard Wordingham via Unicode
On Mon, 4 Feb 2019 22:27:39 +0100
Egmont Koblinger via Unicode  wrote:

> Hi Richard,
> 
> > The concept appears to exist in the form of the fields of the
> > fifth edition of ECMA-48.  Have you digested this ambitious
> > standard?  
> 
> To be honest: No, I haven't. And I have no idea what those "fields"
> are.

(Taken out of order)

> That being said, I'd really, honestly love to see if someone evaluated
> ECMA's "fields" and created a feasibility study for current terminal
> emulators, similarly to how I did it with TR/53.

They mostly seem to be security, protection and checking features.
They seem to make sense for a captive system used as a till or for stock
look-up by customers.  For example, fields can be restricted as to how
they are overwritten, e.g. not at all, or only with numbers, and some
fields cannot be copied from the terminal.  HTML forms seem to provide
most of this functionality nowadays.

Fields are persistent attributes.

On reading further, the pane boundary functionality seems to be
provided by the 'line home position' and 'line limit position'.  These
would have to be re-established whenever a pane became the active pane,
but they seem to support the notion of writing a paragraph into a
pane, with the terminal sorting out the splitting into lines.  I'm not
sure that this would be portable between ECMA-48 terminals; I get
the impression that there would be a reliance on unstandardised
behaviour being appropriate.  I could be wrong; the specification may
be there.

> I spent (read: wasted) way too much time studying ECMA TR/53 to get to
> understand what it's talking about, to realize that the good parts
> were already obvious to me, and to be able to argue why I firmly
> believe that the bad parts are bad. Remember: These documents were
> created in 1991, that is, 28 years ago. (I'm emphasizing it because I
> did the math wrong for a long time, I though it was 18 years ago :-D.)
> Things have a changed a lot since then.

It took me a while to work out that the recommendations of ECMA TR/53
had been implemented in Issue 5 of ECMA-48.

> As for the BiDi docs, I found that the current state of the art,
> current best practices, exisiting BiDi algorithm differ so much from
> ECMA's approach (which no one I'm aware of cared to implement for 28
> years) that the standard is of pretty little use. Only a few good
> parts could be kept (but needed tiny corrections), and plenty of other
> things needed to be build up anew. This is the only reasonable way to
> move forward.

The relationship between the data store and the presentation store
don't seem to be very well defined.  There may be room for the BiDi
algorithm there.

> If you designed a house 2 or 3 years ago, and finally have the money
> to get it built, you can reasonably start building it. If you designed
> a house 28 years ago and finally have the chance to build it
> (including the exact same heating technologies, electrical system
> etc.), you wouldn't, would you? I'm sure you looked at those plans,
> and started at the very least heavily updating them, or started to
> design a brand new one, perhaps somewhat based on your old ideas.

But a scheme may be more persuasive if it can be said to conform to
ECMA-48.

One thing that is very unclear in ECMA-48 is how characters are
allocated to cells in 'implicit' mode.  As the Arabic encoding
considered contained harakat, it looks as though the allocation is
defined by 'unspecified protocols'. I note that in the scheme
apparently given most consideration, forced Arabic presentation forms
are selected by a combination of escape sequences and Arabic letters.
The 'unspecified protocols' could be interpreted as one grapheme
cluster* per group of cells.  The typical groups would be one cell and
the two cells for a CJK character.

*Grapheme cluster is a customisable concept.
 
> I don't expect it to be any different with "fields" of ECMA-48. I'm
> not aware of any terminal emulator implementing anything like them,
> whatever they are. Probably there's a good reason for that. Whatever
> purpose they aimed to serve apparently wasn't important enough for
> such a long time. By now, if they're found important, they should
> probably be solved by some new design (or at the very least, just like
> I did with TR/53, the work should begin by evaluating that standard to
> see if it's still feasible).

> Instead of spending a huge amount of work on my BiDi proposal, I could
> have just said: "guys, let's go with ECMA for BiDi handling". The
> thing is, I'm pretty sure it wouldn't have taken us anywhere. I don't
> expect it to be different with "fields" either.

Your interpretation document would have explored the issues.

> The starting point for my work was the current state of terminal
> emulators and the surrounding ecosystem, plus the current BiDi
> algorithm; not some ancient plan that was buried deep in some drawer
> for almost three decades. I hope this makes sense.

You're assuming that the 

Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators)

2019-02-04 Thread Egmont Koblinger via Unicode
Hi Eli,

> IME, this is a grave mistake.  I hope I explained why; it is now up to
> you to decide what to do about that.

Let me share one more thought.

I have to admit, I'm not an Emacs user, I only have some vague ideas
how powerful a tool it is. But in its very core I still believe it's a
text editor – is it fair to say this? It could be used for example to
conveniently create TUTORIAL.he.

I'm not aware of all the kinds of works you can do in Emacs, but I
have a feeling that the kind of work you do in a terminal emulator is
potentially more diverse. (Let's not nitpick that a terminal can run
emacs and emacs has a terminal inside so mathematically speaking it's
all the same...)

"cat TUTORIAL.he" is indeed one of the commands you can execute in a
terminal, and unfortunately, given what terminals currently understand
from their contents, I just cannot make it display as you would prefer
(and I agree would make a lot of sense). But it's just one use case.

There are plenty of line-oriented tools.

Think of "head" and "tail". They operate on lines of files, which end
up being paragraphs in the terminal according to my definition.
According to your definition, they could cut a paragraph in half, they
could render differently than as if the entire file was printed.
According to my definition, you'll always get the same visual
repsesentation, just on the given fragment of the file.

Think of "grep", possibly combined with "-r" to process files
recursively, and "-C" to print context lines. Not only it can cut
paragraphs (of your definition) in half when it displays the matching
line (plus context), but also how would you locate in its output when
it switches from one match's context to the next match's context
within the same file, or to a match in another file? How would you
define a paragraph, and how would you define the bigger unit on which
the paragraph direction is guessed? I think it's again a use case
where my definition of paragraph is less problematic than yours.

Think of ad-hoc shell scripts that use "echo"/"printf" to inform the
user, "read" to read data etc. Or utilities written in C or whatever
that don't care about terminals at all, just print output. In these
cases there's no one formatting / wrapping at 80 columns performed by
the app. A logical segment is typically printed as a single line,
which will be wrapped by the terminal if doesn't fit in the current
width (and in some terminals rewrapped when the terminal is resized),
this matches my definition of paragraph. There's rarely an empty line
injected in these cases; if there is, it is most likely to separate
some even bigger semantical units.

There are just sooo many use cases, it's impossible to perfectly
address all of them at once. "cat TUTORIAL.he" is just one of them,
not necessarily the most typical, not necessarily the one that should
drive the BiDi design.

Let's note that the four "BiDi-aware" terminals that I could test all
define paragraphs as lines – I mean visual lines on their own canvas.
If the terminal is 80 characters wide, and a utility prints a line of
100 characters, it'll obviously wrap into 80+20 characters. And then
these terminals treat them as two separate paragraphs, one with 80
characters and one with 20, and run BiDi separately on them. I'm
confident that my specification which says that it should be preserved
as a 100 character long paragraph and passed to BiDi accordingly is
already a significant step forward.


cheers,
egmont



Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-04 Thread Egmont Koblinger via Unicode
Hi Eli,

> I think it's unreasonable and impractical to expect 'echo', 'cat', and
> its ilk to emit bidi controls (or any other controls) to force
> paragraph direction.  For starters, they won't know what direction to
> force, because they don't understand the text they are processing.

I agree, it is unreasonable for 'echo', 'cat' etc. to emit BiDi controls.

There could be some higher level helper utiities though, let's say a
"bidi-cat" that examines the file, makes a guess, emits the
corresponding escape sequences and cats the file. It's not necessarily
a good approach, but a possible one (at least temporarily until
terminals implement a better one).

On the other hand, it's not unreasonable for higher level stuff (e.g.
shell scripts, or tools like "zip") to use such control characters.

> No, this simple case must work reasonably well with the application
> _completely_ oblivious to the bidi aspects.  If this can't work
> reasonably well, I submit that the entire concept of having a
> bidi-aware terminal emulator doesn't "hold water".

There isn't a magic wand. I can't magically fix every BiDi stuff by
changing the terminal emulator's source code. Not because I'm clumsy,
but because it just can't be done. If it was possible, I wouldn't have
written a long specification, I would have just done it. (Actually, if
it was possible, others would have sure done it long before I joined
terminal emulator development.)

There need to be multiple modes, some of them due to the technical
particularities of terminal emulation that aren't seen elsewhere (e.g.
explicit vs. implicit), and some of them because they are present
everywhere where it comes to BiDi (e.g. paragraph direction). And if
the mode is not set correctly, things might break, there's nothing new
in it.

What my specification essentially modifies is that with this
specification, you at least will have a chance to get the mode right.

Currently there are perhaps like 4 different behaviors implemented
across terminal emulators when it comes to BiDi. An application cannot
control and cannot query the behavior. In order to get Emacs behave
properly, you have to ask your users to adjust a setting (and I cannot
repeat enough times that I find this an unacceptable user experience).
If the settings of the terminal aren't what Emacs expects, the result
could be broken (RTL words might even show up in reverse, LTR order).

The same goes for the random example of "zip -h", assuming that they
add Hebrew translation. Given the current set of popular terminal
emulators, there's no way zip could emit some Hebrew text in a
reliably readable way. Whatever it does, there will be terminal
emulators (and settings thereof) where the result is totally broken
(reversed), or at least unpleasant (wrong paragraph direction used).
Moreover, if "zip" emits the Hebrew text in the semantically correct
logical order (e.g. they use whatever existing framework, like gettext
and a popular .po editor), as opposed to the visual LTR order seen in
some legacy systems, it will need different terminal emulator settings
than Emacs, so if someone uses both zip and Emacs regularly, they'll
have to continuously toggle their terminal's settings back and forth –
have I mentioned how unacceptable I find this as a user? :)

One of the key points of my specification is that applications will be
able to automatically set the mode. Emacs will be able to switch to
the mode it requires, and so will be zip. They will have the
opportunity.

If they don't live with this opportunity, it's not my problem, and
there's nothing I could do about it. Let's say hypothetically that zip
adds Hebrew translations, but refuses to emit the escape sequence that
switches to RTL paragraph direction, and thus its result doesn't look
perfect. Can terminal emulators, can my specification, can me be
blamed in this case? I don't think so. If zip knows exactly what it
wants to print (as with the help page it knows for sure), and is given
all the technical infrastructure to reliably achieve that, it'd be
solely them to blame if they refused to properly use it. It's
absolutely out of the scope of my work to try to fix this case.

"cat" is substantially different. In case of "zip", the creators of
that software know exactly how the output should look like, and
according to my specification (assuming a confirming terminal
emulator, of course) nothing stops them from achieving it. "cat"
doesn't know, cannot know the desired look, since the file itself
lacks this information.

Paragraph direction is a concept that sucks big time. (I have no idea
how Unicode could have got it better, though.) It's a piece of
information that needs to be carried externally along with the text,
in order to make sure it'll be displayed correctly. It's a pain in the
butt, just as much carrying the encoding in the pre-Unicode days was,
and hardly anyone cared about, resulting in incorrect accented letters
way too often. Practically everyone's lazy and 

Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators)

2019-02-04 Thread Richard Wordingham via Unicode
On Tue, 5 Feb 2019 00:08:10 +0100
Egmont Koblinger via Unicode  wrote:

> Hi Eli,
> 
> > Actually, UAX#9 defines "paragraph" as the chunk of text delimited
> > by paragraph separator characters.  This means characters whose bidi
> > category is B, which includes Newline, the CR-LF pair on Windows,
> > U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR.  

It actually gives two different definitions.  Table UAX#9 4 restricts
the type B to *appropriate newline functions; not all newlines are
paragraph separators.

> Indeed, this was an oversight on my side. So, with this definition,
> every single newline character starts a new paragraph. The result of
> printf "Hello\nWorld\n" > world.txt
> is a text file consisting of two paragraphs, with 5 characters in
> each. Correct?

No, it depends on when a newline function is 'appropriate'.  TUS 5.8
Rule R2b applies - 'In simple text editors, interpret any NLF the same
as LS'.

> > Actually, Emacs implements the rule that paragraphs are separated by
> > empty lines.  This is documented in the Emacs manuals.  
> 
> That is, Emacs overrides UAX#9 and comes up with a different
> definition? Furthermore, you argue that in terminals I should follow
> Emacs's definition rather than Unicode's? Or please clarify if I
> misunderstood you here.

He's deriving 'B' from a protocol.

Richard.


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators)

2019-02-04 Thread Egmont Koblinger via Unicode
Hi Eli,

> Actually, UAX#9 defines "paragraph" as the chunk of text delimited by
> paragraph separator characters.  This means characters whose bidi
> category is B, which includes Newline, the CR-LF pair on Windows,
> U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR.

Indeed, this was an oversight on my side. So, with this definition,
every single newline character starts a new paragraph. The result of
printf "Hello\nWorld\n" > world.txt
is a text file consisting of two paragraphs, with 5 characters in each. Correct?

> Actually, Emacs implements the rule that paragraphs are separated by
> empty lines.  This is documented in the Emacs manuals.

That is, Emacs overrides UAX#9 and comes up with a different
definition? Furthermore, you argue that in terminals I should follow
Emacs's definition rather than Unicode's? Or please clarify if I
misunderstood you here.

> > while Emacs itself is a viewer that treats runs between single
> > newlines as paragraphs. That is, Emacs is inconsistent with itself.
>
> Incorrect.  Emacs always treats a run of text between empty lines as a
> single paragraph, in TUTORIAL.he and everywhere else.  There's nothing
> special about TUTORIAL.he, it is just a plain text file with a few
> dozen of bidi formatting controls, needed to show the key sequences
> with weak and neutral characters in correct visual order.  [...]

Thanks for the clarification, I believe it's clear to me now.

> At least with Emacs, it is not the same.  I think considering each
> line as a separate paragraph makes writing bidi plain-text documents
> that look right almost impossible, if each line ends in a newline [...]

> My personal recommendation is to adopt theempty line rule.  It's
> simple enough and gives good results IME. [...]

> I'm surprised that you describe this as such a complex problem.  I
> think you explained up-thread that terminal emulators should cope with
> lines of text arriving piecemeal, which I interpreted as meaning that
> text is stored in the emulator's memory.  Modern emulators running on
> windowed desktops also provide scroll-back buffers, and react to
> expose events.  So I think the text that is currently in the viewport,
> and also some text previously shown, are stored in memory, and can be
> consulted.

The problem is not the memory management.

Let's look at the following session:

---snip---
prompt$ cat file1.txt
This is the
first human-perceived paragraph.

And this is the
second.
prompt$ cat file2.txt
Here this is the
third paragraph.

And this one is
the fourth.
prompt$
---snip---

If you load the files to Emacs, it is perfectly aware of the contents
of the two files. It can define paragraphs however it wants to, and
BiDi the files accordingly.

The terminal emulator doesn't know what's a shell prompt, what's a
command that the user types, what's the output of that command. (You
don't know either from this snippet. Maybe I only cat'ed file1.txt,
and "prompt$ cat file2.txt" is just the sixth line of this eleven-line
file.)

In the terminal emulator's eyes, with Emacs's definition (empty line
delimited), this is one paragraph:

prompt$ cat file1.txt
This is the
first human-perceived paragraph.

and this is another paragraph:

And this is the
second
prompt$ cat file2.txt
Here this is the
third paragraph.

and similarly for the third one.

I believe I understand your concerns with the per-line paragraph
definition, but this interpretation that I've just shown most likely
leads to even more broken behavior.

It's a really nontrivial technical problem to let the terminal
emulator know where each prompt, and/or each command's output begins
and ends. There's work going on for letting the terminal emulator
recognize the prompts, but even if it's successful, it'll probably
take 5-10 years to reach the majority of the users. And it probably
still wouldn't solve the case of knowing the boundary between the two
outputs if a "cat file1.txt; cat file2.txt" is executed, let alone if
they're concatenated with "cat file1.txt file2.txt".

So, what you're arguing for, is that the default behavior should be
something that's:
- currently not implementable in a semantically correct way (to stop
around shell prompts) due to technical limitations, and
- isn't what Unicode says.

You have not convinced me that the pros outweigh the cons. That being
said, I'm more than open to see such a behavior as a future extension,
subject of course to the semantic prompt stuff being available.


cheers,
egmont


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-04 Thread Richard Wordingham via Unicode
On Mon, 04 Feb 2019 22:39:07 +0200
Eli Zaretskii via Unicode  wrote:

> > Date: Mon, 4 Feb 2019 19:45:13 +
> > From: Richard Wordingham via Unicode 
> > 
> > Yes.  If one has a text composed of LTR and RTL paragraphs, one has
> > to choose how far apart their starting margins are.  I think that
> > could get complicated for plain text if the terminal has unbounded
> > width.  
> 
> But no real-life terminal does.  The width is always bounded.

The Emacs terminal (M-x term) seems to be a reasonable approximation,
with the scroll-left and scroll-right commands changing the margins'
separations.  This is an example of a terminal that has lines with
left-to-right character paths and lines with right-to-left
character paths.  (Such lines are necessarily separated by blank
lines.)  Geometrically, column positions on left-to-right and
right-to-left character paths are incomparable - resizing the window
and scrolling move them differently.

Richard.


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-04 Thread Egmont Koblinger via Unicode
> > Yes.  If one has a text composed of LTR and RTL paragraphs, one has to
> > choose how far apart their starting margins are.  I think that could
> > get complicated for plain text if the terminal has unbounded width.
>
> But no real-life terminal does.  The width is always bounded.

Allegedly the no longer maintained FinalTerm, and maybe another one or
two not so popular terminal emulators experimented with this.

VTE and a few other emulators have also received such a feature
request; VTE has rejected it. See
https://bugzilla.gnome.org/show_bug.cgi?id=769440 if you're curious.

Indeed BiDi becomes problematic in the sense that Richard pointed out:
how far should the starting margins be from each other? By terminal
emulators rejecting the idea of unbounded width, this is not a problem
for them.

It might still be a problem for BiDi aware text viewers/edtiors,
though. I mean one possible, obvious approach could be to adjust them
according to the terminal's width. Another is to take it from the
file's contents (e.g. longest line). But maybe there's demand for
other options, e.g. to have those margins 80 characters away from each
other even when the file is viewed on a mobile phone where the
viewport is narrower and the user wishes to scroll horizontally. This
is up for text viewers/editors to decide.


cheers,
egmont


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-04 Thread Egmont Koblinger via Unicode
Hi Richard,

> That split is wrong if you want the non-HTML text to lay out reasonably
> well in anything but a higher order protocol forcing RTL.  You need to
> it split as:
>
> lorem ipsum ABC
> <[ DEF foobar

Okay, so you should use LRMs or other similar tricks when wrapping a
human-perceived paragraph of text.

I take it as:

- The expected definition of "paragraph", for the technical sake of
running the BiDi algorithm, is lines of the text file (that is,
between a newline and the next one).

- On top of this technical definition, the document is crafted so that
lines are not longer than a certain threshold, and the human-perceived
paragraphs are usually delimited by empty lines (sometimes by other
means, like bullets of a list).

Sounds like a reasonable approach to me, probably the best to have.
And, by the way, aligns with my BiDi proposal if the higher level
protocol (escape sequences) set the paragraph direction correctly and
disable autodetection.


cheers,
egmont


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-04 Thread Egmont Koblinger via Unicode
Hi Richard,

> The concept appears to exist in the form of the fields of the
> fifth edition of ECMA-48.  Have you digested this ambitious standard?

To be honest: No, I haven't. And I have no idea what those "fields" are.

I spent (read: wasted) way too much time studying ECMA TR/53 to get to
understand what it's talking about, to realize that the good parts
were already obvious to me, and to be able to argue why I firmly
believe that the bad parts are bad. Remember: These documents were
created in 1991, that is, 28 years ago. (I'm emphasizing it because I
did the math wrong for a long time, I though it was 18 years ago :-D.)
Things have a changed a lot since then.

As for the BiDi docs, I found that the current state of the art,
current best practices, exisiting BiDi algorithm differ so much from
ECMA's approach (which no one I'm aware of cared to implement for 28
years) that the standard is of pretty little use. Only a few good
parts could be kept (but needed tiny corrections), and plenty of other
things needed to be build up anew. This is the only reasonable way to
move forward.

If you designed a house 2 or 3 years ago, and finally have the money
to get it built, you can reasonably start building it. If you designed
a house 28 years ago and finally have the chance to build it
(including the exact same heating technologies, electrical system
etc.), you wouldn't, would you? I'm sure you looked at those plans,
and started at the very least heavily updating them, or started to
design a brand new one, perhaps somewhat based on your old ideas.

I don't expect it to be any different with "fields" of ECMA-48. I'm
not aware of any terminal emulator implementing anything like them,
whatever they are. Probably there's a good reason for that. Whatever
purpose they aimed to serve apparently wasn't important enough for
such a long time. By now, if they're found important, they should
probably be solved by some new design (or at the very least, just like
I did with TR/53, the work should begin by evaluating that standard to
see if it's still feasible).

Instead of spending a huge amount of work on my BiDi proposal, I could
have just said: "guys, let's go with ECMA for BiDi handling". The
thing is, I'm pretty sure it wouldn't have taken us anywhere. I don't
expect it to be different with "fields" either.

The starting point for my work was the current state of terminal
emulators and the surrounding ecosystem, plus the current BiDi
algorithm; not some ancient plan that was buried deep in some drawer
for almost three decades. I hope this makes sense.

That being said, I'd really, honestly love to see if someone evaluated
ECMA's "fields" and created a feasibility study for current terminal
emulators, similarly to how I did it with TR/53.


cheers,
egmont


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-04 Thread Eli Zaretskii via Unicode
> Date: Mon, 4 Feb 2019 19:45:13 +
> From: Richard Wordingham via Unicode 
> 
> Yes.  If one has a text composed of LTR and RTL paragraphs, one has to
> choose how far apart their starting margins are.  I think that could
> get complicated for plain text if the terminal has unbounded width.

But no real-life terminal does.  The width is always bounded.


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-04 Thread Richard Wordingham via Unicode
On Mon, 04 Feb 2019 18:53:22 +0200
Eli Zaretskii via Unicode  wrote:

> Date: Mon, 4 Feb 2019 01:19:21 +
> From: Richard Wordingham via Unicode 

>> If you look at it in Notepad, all
>> lines will be LTR or all lines will be RTL.  
 
> That's because Notepad implements _only_ the higher-level protocol for
> base paragraph direction: there's no way to make Notepad determine the
> direction by looking at the text.

Yes.  If one has a text composed of LTR and RTL paragraphs, one has to
choose how far apart their starting margins are.  I think that could
get complicated for plain text if the terminal has unbounded width.

Richard.


Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-04 Thread Eli Zaretskii via Unicode
> Date: Mon, 4 Feb 2019 01:19:21 +
> From: Richard Wordingham via Unicode 
> 
> On Sun, 03 Feb 2019 19:50:50 +0200
> Eli Zaretskii via Unicode  wrote:
> 
> > Do you see how this is carefully formatted to avoid overflowing an
> > 80-column line of a typical terminal?  Now suppose this is translated
> > into a RTL language, which causes the Copyright line to start with a
> > strong R letter (because "Copyright" is translated).  You will see the
> > first line flushed to the right margin, then the next line flushed to
> > the left margin (because it's a separate paragraph, and starts with a
> > strong L letter).  Then the line which says "The default action..."
> > will again start at the right.  And so on and so forth -- the result
> > is extremely ugly.
> 
> Depending on the environment.  If you look at it in Notepad, all lines
> will be LTR or all lines will be RTL.

That's because Notepad implements _only_ the higher-level protocol for
base paragraph direction: there's no way to make Notepad determine the
direction by looking at the text.

> Would not a careful translator either ensure that each non-blank
> line had a strong character and that all first strong characters
> were (a) L, (b) R or (c) AL?

This is very hard in practice, and is a tremendous annoyance when
translating message catalogs to RTL languages.  Translation is a hard
enough job even without this complication.


Re: Bidi paragraph direction in terminal emulators BiDi in terminal emulators)

2019-02-04 Thread Eli Zaretskii via Unicode
> From: Egmont Koblinger 
> Date: Mon, 4 Feb 2019 00:36:23 +0100
> Cc: unicode@unicode.org
> 
> The Unicode BiDi algorithm states that it operates on paragraphs of
> text, and leaves it up to a higher protocol to define what a paragraph
> exactly is.
> 
> What's the definition of "paragraph" in the context of plain text files?
> 
> I don't think there's a single well-established practice.

Actually, UAX#9 defines "paragraph" as the chunk of text delimited by
paragraph separator characters.  This means characters whose bidi
category is B, which includes Newline, the CR-LF pair on Windows,
U+0085 NEL, and U+2029 PARAGRAPH SEPARATOR.

> In some, e.g. in Emacs's TUTORIAL.he, or markdown files, it's way
> more complicated, probably there isn't a well-defined grammar for
> how exactly bullet list entries and alike should become new
> paragraphs.

Actually, Emacs implements the rule that paragraphs are separated by
empty lines.  This is documented in the Emacs manuals.  (That's by
default, users and Lisp programs can control that to some extent.)
This rule is global, and applied to any file or buffer, including
TUTORIAL.he.

> lorem ipsum FED ]> CBA foobar
> 
> The visual representation, in a narrower viewport, might wrap for
> example like this:
> 
> lorem ipsum CBA
> FED ]> foobar

I suggest to leave line wrapping alone for the moment: it is a further
complication.  Let's first talk about text whose every line ends in a
hard newline -- this is what you see in most "simple" text-mode
utilities which we are talking about.  If/when we solve the problems
there, we can then look at the issues with wrapping.

> Here comes the twist. Let's view this latter file with a viewer that
> uses a _different_ definition for paragraph. Let's view it in Gedit,
> Emacs, or the work-in-progress BiDi-aware VTE by "cat"ing it, where
> every newline begins a new paragraph – that's how these viewers define
> the notion of "paragraph" for the sake of BiDi.
> 
> The visual layout in these viewers becomes:
> 
> lorem ipsum CBA
> <[ FED foobar
> 
> which is just not correct. Since here BiDi is run on the two lines
> separately, the initial "<[" is treated as LTR, placed at the wrong
> location in the wrong order, and the glyphs aren't mirrored.

This kind of problems happens all the time, and you cannot avoid it.
Different programs display bidi text differently.  I propose not to
try to solve this problem, because IME it cannot be solved in general.
Let's focus on the terminal emulators that should comply with your
guidelines, and let's try to decide what should they do about base
paragraph direction of text emitted by "simple" text utilities.
If they all make decisions by the same rule, they all will show the
same text identically.

> Now, Emacs ships a TUTORIAL.he which, for most of its contents (but
> not everywhere) seems to treat runs between empty lines as paragraphs,

Correct.

> while Emacs itself is a viewer that treats runs between single
> newlines as paragraphs. That is, Emacs is inconsistent with itself.

Incorrect.  Emacs always treats a run of text between empty lines as a
single paragraph, in TUTORIAL.he and everywhere else.  There's nothing
special about TUTORIAL.he, it is just a plain text file with a few
dozen of bidi formatting controls, needed to show the key sequences
with weak and neutral characters in correct visual order.  (Some of
those controls can probably be removed nowadays, since we now have the
BPA of Unicode 6.3 -- the file was written before Unicode 6.3 was
released.)  In fact, I wrote that tutorial as an exercise, to prove to
myself that Emacs can be useful for editing non-trivial bidi text.

> In case you think I got something wrong with Emacs: Could you please
> give exact definitions:
> - What are the exact units (so-called "paragraphs" by UAX9) that it
> runs BiDi on when it loads and displays a file?

See above: for the purpose of the Emacs UBA implementation, paragraphs
are separated by empty lines.  That is the only rule in EMacs
regarding paragraph determination.

> - What are the exact units (so-called "paragraphs" by UAX9) in
> TUTORIAL.he on which BiDi needs to be run in order to get the desired
> readable version?

The same.  There's nothing special about that file.

> What most likely happens is that in order to see a difference, you'd
> need to have more special symbols, or at least a more special
> constellation of them. Probably TUTORIAL.he is just luckily simple
> enough that such a difference isn't hit.

No, TUTORIAL.he is neither "lucky" nor "simple".  I deliberately used
there almost every bidi formatting control there is, where
appropriate, to make sure this stiff works as intended in an otherwise
plain text file.

> Another possibility is (and I cannot check because I can't speak
> Hebrew) that somewhere TUTORIAL.he "cheats" with the logical order to
> get the desired visual one.

There's no cheating there, I assure you.

> This definition of paragraph (stuff between a newline and 

Re: Bidi paragraph direction in terminal emulators (was: Proposal for BiDi in terminal emulators)

2019-02-03 Thread Richard Wordingham via Unicode
On Mon, 4 Feb 2019 00:36:23 +0100
Egmont Koblinger via Unicode  wrote:

> Now, back to terminals.
> 
> The smallest possible viable definition of a "paragraph" in terminal
> emulators is stuff between one newline and the next one.
> 
> It would require a hell lot of work, redesigning (overcomplicating)
> plenty of basics of terminal emulation to be able to come up with
> smaller units, e.g. cells of a table – a concept that doesn't
> currently exist in this world –, I don't find any such approach
> feasible at all.

The concept appears to exist in the form of the fields of the
fifth edition of ECMA-48.  Have you digested this ambitious standard?
ECMA-48 has the concept of hyphenation and wrapping! (Well, in Appendix
C it does.  I haven't fully tied it in with the receipt of characters.)

Richard.



  1   2   >