Re: The .bytes/.codepoints/.graphemes methods

2004-07-13 Thread David Green
In article [EMAIL PROTECTED], [EMAIL PROTECTED] (Larry Wall) wrote: On Tue, Jun 29, 2004 at 10:52:34AM -0500, Jonathan Scott Duff wrote: : :u0 # use bytes (. is byte) : :u1 # level 1 support (. is codepoint) : :u2 # level 1 support (. is

RE: The .bytes/.codepoints/.graphemes methods

2004-07-12 Thread Austin Hastings
-Original Message- From: Jonadab the Unsightly One [mailto:[EMAIL PROTECTED] Austin Hastings [EMAIL PROTECTED] writes: I think this is something that we'll want as a mode, a la case-insensitivity. Think of it as mark insensitivity. Makes sense to me, but... Maybe it can just

Re: The .bytes/.codepoints/.graphemes methods

2004-07-12 Thread Jonadab the Unsightly One
Luke Palmer [EMAIL PROTECTED] writes: Or, god forbid, a word? m:base/que mas/ We're not mathematicians: we're allowed to use more than one letter in a row to designate something :-) Well, if it were *me*, *I* would have voted for keeping the core language 100% pure ASCII, untainted by

Re: The .bytes/.codepoints/.graphemes methods

2004-07-10 Thread Jonadab the Unsightly One
Austin Hastings [EMAIL PROTECTED] writes: I think this is something that we'll want as a mode, a la case-insensitivity. Think of it as mark insensitivity. Makes sense to me, but... Maybe it can just roll into :i? It will probably get used in _conjunction_ with case-insensitivity quite a

Re: The .bytes/.codepoints/.graphemes methods

2004-07-10 Thread Luke Palmer
Jonadab the Unsightly One writes: Austin Hastings [EMAIL PROTECTED] writes: I think this is something that we'll want as a mode, a la case-insensitivity. Think of it as mark insensitivity. Makes sense to me, but... Maybe it can just roll into :i? It will probably get used in

Re: The .bytes/.codepoints/.graphemes methods

2004-07-08 Thread Austin Hastings
--- Larry Wall [EMAIL PROTECTED] wrote: On Tue, Jun 29, 2004 at 10:52:34AM -0500, Jonathan Scott Duff wrote: : Or was that to imply that a literal a in the RE would be : interpretted as a grapheme a when :u2 is active? I don't know what you mean by grapheme a there. If you mean, Does it

Re: The .bytes/.codepoints/.graphemes methods

2004-07-07 Thread Larry Wall
On Tue, Jun 29, 2004 at 10:52:34AM -0500, Jonathan Scott Duff wrote: : On Tue, Jun 29, 2004 at 08:34:16AM -0700, Austin Hastings wrote: : This has no direct bearing on p6l, since performance is a p6i issue. : But perhaps in the interests of performance as well as hackery we : should explicitly

Re: The .bytes/.codepoints/.graphemes methods

2004-07-07 Thread Larry Wall
On Wed, Jul 07, 2004 at 08:09:51PM -0700, Larry Wall wrote: : On Tue, Jun 29, 2004 at 10:52:34AM -0500, Jonathan Scott Duff wrote: : : On Tue, Jun 29, 2004 at 08:34:16AM -0700, Austin Hastings wrote: : : This has no direct bearing on p6l, since performance is a p6i issue. : : But perhaps in the

Re: The .bytes/.codepoints/.graphemes methods

2004-07-03 Thread Brent 'Dax' Royal-Gordon
Aaron Sherman wrote: On Tue, 2004-06-29 at 11:34, Austin Hastings wrote: (2) Perl6 should equitably support all its target locales; (3) we should set out to make sure the performance is damn fast no matter what locale we're using. Well, that's a nice theory, but you can prove that low-level

Re: The .bytes/.codepoints/.graphemes methods

2004-07-02 Thread Aaron Sherman
On Tue, 2004-06-29 at 11:34, Austin Hastings wrote: [...] when you switch to LC_ALL= pick your favorite language, you just get really slow performance: Apparently the 'C' locale is such a totally special case that the performance of LC_ALL=C is one or more orders of magnitude better than

Re: The .bytes/.codepoints/.graphemes methods

2004-07-01 Thread Matt Diephouse
Larry Wall wrote: On Sat, Jun 26, 2004 at 12:27:38PM -0700, Brent 'Dax' Royal-Gordon wrote: : Issues: : * Limits lvalue substr (doesn't allow it to be a different size) : unless splice is used (or a substr method is also provided). That all has to be looked at anyway. What does 5 mean when

Re: The .bytes/.codepoints/.graphemes methods

2004-07-01 Thread Juerd
Matt Diephouse skribis 2004-06-30 20:51 (-0400): my $string = Hello, World!; say $string[0..4]; # prints Hello\n $string[7...] = Larry!; say $string; # prints Hello, Larry!\n And that array is one of bytes? graphemes? In general, I like the idea. In [EMAIL PROTECTED], almost the same was

Re: The .bytes/.codepoints/.graphemes methods

2004-07-01 Thread Matt Diephouse
Juerd wrote: Matt Diephouse skribis 2004-06-30 20:51 (-0400): my $string = Hello, World!; say $string[0..4]; # prints Hello\n $string[7...] = Larry!; say $string; # prints Hello, Larry!\n And that array is one of bytes? graphemes? I'm not really up on my unicode, but I think .chars is what I have

Re: The .bytes/.codepoints/.graphemes methods

2004-07-01 Thread John Williams
On Thu, 1 Jul 2004, Juerd wrote: Matt Diephouse skribis 2004-06-30 20:51 (-0400): my $string = Hello, World!; say $string[0..4]; # prints Hello\n $string[7...] = Larry!; say $string; # prints Hello, Larry!\n And that array is one of bytes? graphemes? In general, I like the idea.

Re: The .bytes/.codepoints/.graphemes methods

2004-06-29 Thread Jonadab the Unsightly One
Dan Sugalski [EMAIL PROTECTED] writes: Hmm. Suppose that I have a system that is friendly to 80 byte records. I want to output meaningful strings, so I want to partition a buffer into 80-ish byte substrings, but preserve any graphemes (i.e., store the data in a legible format). How would I

Re: The .bytes/.codepoints/.graphemes methods

2004-06-29 Thread Jonadab the Unsightly One
Austin Hastings [EMAIL PROTECTED] writes: A couple of alternatives: substr.bytes($string, 2, 4) = $substitute; Well, that's arguably better than bsubstr. substr($string.bytes, 2, 4) = $substitute; I could live with that, although it doesn't allow mixing units. (Someone will pop in here

Re: The .bytes/.codepoints/.graphemes methods

2004-06-29 Thread Austin Hastings
--- Jonadab the Unsightly One [EMAIL PROTECTED] wrote: Have the implications of the bytes/codepoints/graphemes/woohickies distinction for the regular expression engine been discussed already? Not enough. One of my current clients just rolled on to redhat 9, and what a steaming pile of

Re: The .bytes/.codepoints/.graphemes methods

2004-06-29 Thread Jonadab the Unsightly One
Juerd [EMAIL PROTECTED] writes: substr($string, 2 but graphemes, 4 but bytes); I think but even makes sense, if substr defaults to something. That could be combined with a smart substr that only needs the units once (err, only needs a position object for one of the args) and knows how to

Re: The .bytes/.codepoints/.graphemes methods

2004-06-29 Thread Jonathan Scott Duff
On Tue, Jun 29, 2004 at 08:34:16AM -0700, Austin Hastings wrote: This has no direct bearing on p6l, since performance is a p6i issue. But perhaps in the interests of performance as well as hackery we should explicitly provide some sort of variant regex behavior: /a./ :bytes /a./

Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Jonadab the Unsightly One
Larry Wall [EMAIL PROTECTED] writes: That all has to be looked at anyway. What does 5 mean when you pass it to substr, anyway? I was just going to ask about substrings, and then didn't because I figured that had been hashed out already and I'd missed it... (I've been trying to make it

Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Larry Wall
On Mon, Jun 28, 2004 at 11:26:32AM -0400, Jonadab the Unsightly One wrote: : You could coin the abbreviation ligs, for Language Independent : Graphemes. Then some ingenious rascal can create a pragma or whatever : that allows $str.b, $str.c, $str.g, and $str.l for fans of terseness. Except

Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Dave Whipp
Jonadab The Unsightly One [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] It would be possible to have right-associative operators (that bind at least more tightly than comma and possibly very tightly) and convert a number to one of these objects, so that we can do stuff like this:

Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Dan Sugalski
On Mon, 28 Jun 2004, Larry Wall wrote: On Mon, Jun 28, 2004 at 11:26:32AM -0400, Jonadab the Unsightly One wrote: : You could coin the abbreviation ligs, for Language Independent : Graphemes. Then some ingenious rascal can create a pragma or whatever : that allows $str.b, $str.c, $str.g, and

Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Juerd
Dave Whipp skribis 2004-06-28 9:55 (-0700): substr($string, 2 bytes, 4 bytes) = $substitute; substr($string, 2, 4 :bytes) substr($string, 2 but graphemes, 4 but bytes); I think but even makes sense, if substr defaults to something. Juerd

Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Dan Sugalski
On Mon, 28 Jun 2004, Juerd wrote: Dave Whipp skribis 2004-06-28 9:55 (-0700): substr($string, 2 bytes, 4 bytes) = $substitute; substr($string, 2, 4 :bytes) substr($string, 2 but graphemes, 4 but bytes); I think but even makes sense, if substr defaults to something. I think mixing

Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Austin Hastings
--- Dan Sugalski [EMAIL PROTECTED] wrote: On Mon, 28 Jun 2004, Juerd wrote: Dave Whipp skribis 2004-06-28 9:55 (-0700): substr($string, 2 bytes, 4 bytes) = $substitute; substr($string, 2, 4 :bytes) substr($string, 2 but graphemes, 4 but bytes); I think but even makes sense, if

Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Dan Sugalski
On Mon, 28 Jun 2004, Austin Hastings wrote: --- Dan Sugalski [EMAIL PROTECTED] wrote: On Mon, 28 Jun 2004, Juerd wrote: Dave Whipp skribis 2004-06-28 9:55 (-0700): substr($string, 2 bytes, 4 bytes) = $substitute; substr($string, 2, 4 :bytes) substr($string, 2 but

Re: The .bytes/.codepoints/.graphemes methods

2004-06-28 Thread Austin Hastings
--- Jonadab the Unsightly One [EMAIL PROTECTED] wrote: Larry Wall [EMAIL PROTECTED] writes: (I've been trying to make it assume some implicit unit based on the current lexical scope's Unicode level, but issues remain.) We have magical string positions that have different numeric values

The .bytes/.codepoints/.graphemes methods

2004-06-26 Thread Brent 'Dax' Royal-Gordon
As currently designed, the String::bytes, String::codepoints, and String::graphemes methods return the number of bytes, codepoints, and graphemes, respectively, in the string they were called on. I would like to suggest that, when called in list context, these methods return an array of