#2 is very near what you get today by looking at source code in an environment that attempts to run the Unicode BiDi algorithm on your source, for example a browser. In my experience, this can be very confusing.
If I'm understanding your proposal correctly, in #2 you still want to fix the crazy cases of flipped capture parentheses etc., but that implies your editor must understand regular expressions, which at best it will do when something is a regexp literal but which will fail when the programmer is constructing a regexp piecemeal from strings. Indeed, your solution doesn't attempt to handle simple strings, so the user can get confused by inconsistency issues: why $foo = "SHALOM" but $foo = qr/MOLAHS/? Also, there's the tokenization problem you already mention: NATBA"G would come out very bad from this transformation. So if these are the two options, I'd vote for #1. (I'd also recommend in source code representing invisible override characters as entities, so that if the actual data has RLM/LRM/etc. marks, where possible it's better not to have those as literals in the source code. That's a recommendation for the end programmer, not the editor, I think.) Perhaps one day we can have an alternate surface syntax for strings and regular expressions that is designed to be RTL friendly. Hebrew or Arabic metacharacters, introduced with slashes instead of backslashes. This would be used in conjunction with an editor hint that makes the whole expression RTL (in Unicode terms, sets the paragraph directionality). If used on whole lines of code, it can also change the alignment to be right-justified. Perl 6 has many quotelike operators, with room for extensions like this. The problem with these things is that they're hard to get right, and pesky problems will still appear no matter how hard you try if the data really does contain mixed directionality characters. We can ask Larry for his opinion, though. On Sun, Feb 1, 2009 at 5:08 AM, Amit Aronovitch <[email protected]> wrote: > Hi, > > Following a discussion I took part in about standartization of the > display of Hebrew text in structured expressions and source code, I > would be happy to hear some opinions about how we would like regular > expressions containing bidi chars to be displayed (in an "ideal > editor" that is fully syntax aware). > > In the examples below, caps represent RTL characters and lowercase LTR chars. > > The basic principle that was proposed (for structured expressions) is > that text should be split into "separators" and "tokens" according to > the relevant syntax, the general-purpose Bidi rules be applied within > each token only, and then tokens and separators should be concatenated > left to right always. > > Applied to regular expressions, I thought that since in RE each > pattern character is an atom (1), then this effectively means to force > LTR everywhere (except maybe stuff like named captures > (?<NAME>...) etc.). > However, it was suggested instead that any sequence of pattern > characters (not containing "special" characters) should be treated as > a token (2). > This would make simple searches easier to read, > > e.g. /SHALOM/ would be displayed /SHALOM/ by (1), > but /MOLAHS/ (much more readable if it was actual Hebrew) by (2). > On the other hand, /YADAII?M/ in (2) would show as /IIADAY?M/ , > which is very confusing, so I thought the simplification was not worth it. > > However, I am used to languages where simple searches are commonly > done by other means, whereas in Perl using RE for simple text search > might be more common because of the specialized syntax. What do you > think? > > Amit > _______________________________________________ > Perl mailing list > [email protected] > http://perl.org.il/mailman/listinfo/perl > -- Gaal Yahas <[email protected]> http://gaal.livejournal.com/ _______________________________________________ Perl mailing list [email protected] http://perl.org.il/mailman/listinfo/perl
