Re: [9fans] Octets regexp

2013-05-03 Thread tlaronde
On Thu, May 02, 2013 at 04:17:11PM -0400, 9p...@imu.li wrote:
 
 if we're talking about xd, i'll suggest 'tcs -f 8859-1' again in which case:
 

My question was _not_ related to text, and _not_ related to french i.e.
8859-1. I know how to deal with this.

-- 
Thierry Laronde tlaronde +AT+ polynum +dot+ com
  http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C



Re: [9fans] Octets regexp

2013-05-03 Thread Tristan
  if we're talking about xd, i'll suggest 'tcs -f 8859-1' again in which case:

 My question was _not_ related to text, and _not_ related to french i.e.
 8859-1. I know how to deal with this.

tcs -f 8859-1

will take your _binary_ files, and replace the bytes 0x80-0xff with the
unicode points U0080-U00ff, so you can use the standard regexps and tools
on them. and just convert back afterwards.

maybe it's not meant to be used that way, but it _works_. try it.

have fun!
tristan

-- 
All original matter is hereby placed immediately under the public domain.



Re: [9fans] Octets regexp

2013-05-03 Thread tlaronde
On Fri, May 03, 2013 at 09:15:27AM -0400, Tristan wrote:
 
 tcs -f 8859-1
 
 will take your _binary_ files, and replace the bytes 0x80-0xff with the
 unicode points U0080-U00ff, so you can use the standard regexps and tools
 on them. and just convert back afterwards.
 

OK, mea culpa... since I'm french, I focused on the latin1 thinking 
this has something to do with my language and the custom to deal with
latin1 on other systems.

I guess I could create a keyboard that produces not UTF-8 but bytes
so to have a mean to input bytes (without resorting to printf or
whatever). Remains the problem of the rendering (or create a
special font that displays octal, hexadecimal or whatever playing
with the index of the glyphes; but this will work for octets, and will
be more difficult if one wants to deal with wydes; impossible with 
tetras and octas).

--
Thierry Laronde tlaronde +AT+ polynum +dot+ com
  http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006
F40C



Re: [9fans] Octets regexp

2013-05-02 Thread erik quanstrom
 Regexp(6) handles characters that are runes.

perhaps the man page is misleading.  rune in this context means utf-8.
see regexp(2).  all the functions take char*s.

 I wonder if Plan9 developers, when trying to design a way towards some
 localization, have ever thought of bytes (octets) regexp, that is using
 regexp with not rune but octets strings (maybe UTF-8 as is) allowing to
 use regexp with binary too, not only newline terminated chunks etc.?

one of the points of plan 9 was to standardize on one character set,
utf-8.  imho, localization and character set aren't related unless one
is dealing with 8859-x overlays or some other character set insufficient
to represent the range of languages.

however, sam and acme allow for structured regular expressions,
and are generally not line oriented:

http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf

and iirc, cinap has written a cifs bit that uses a bit of binary matching.

- erik



Re: [9fans] Octets regexp

2013-05-02 Thread tlaronde
On Thu, May 02, 2013 at 08:48:06AM -0400, erik quanstrom wrote:
  Regexp(6) handles characters that are runes.
 
 perhaps the man page is misleading.  rune in this context means utf-8.
 see regexp(2).  all the functions take char*s.

But the source files deal with runes...

 
 one of the points of plan 9 was to standardize on one character set,
 utf-8.  imho, localization and character set aren't related unless one
 is dealing with 8859-x overlays or some other character set insufficient
 to represent the range of languages.
 

Localization (as handled in POSIX for example) is a mess. So the Plan9
solution, with still octets (UTF-8) makes far more sense, since it
allows to extend, for the user, the characters that can be used in
naming computer objects, but this is just for nicknames: the system 
still speaks C/9P. 

So it is better, except perhaps for one thing: for me, the system
speaks C or even, obviously, Plan9 (well: 9P). It does not have
to speak french, hebrew, etc. or even english! So it takes or gives
bytes, and this is good.  But the UTF-8 encoding is the main convention
for user interface, but can it be unset? I mean, can one use a
raw window, putting uninterpreted bytes, and rendering bytes (with
a special ASCII font with whether ASCII + 0xdd glyphes or whatever,
using fonts to do what is done with vis(1) on Unices or od(1)/xd(1))
and do not impose the assumption that the octet strings is UTF-8? Can 
one make a file entering bytes---i.e. binary values that yield 
incorrect UTF-8 sequences?

This is a reflexion made to me by a developer who can use, when
needed, regexp (ed(1) or sed(1)) on an Unix where they still deal
with char (bytes) to search for a string of bytes in a binary.

And after some thought, I don't see an obvious reason why the regexp
could not be used with bytes strings (so UTF-8 is OK) without trying to
match runes (since not every bytes string is a correct UTF-8 sequence).

Corollary: I don't know if there is an UTF-8 sequence that can tell:
stop interpreting as UTF-8, takes as is (except every incorrect
sequence, problem being to come back from there: if everything is OK as
is, what can be interpreted as: stops raw, restart
UTF-8---solution: this is on user level, not low level, and this is in
the shell explicitely delimiting chunks, like ' is the only delimiter,
and every embedded ' has to be escaped by doubling it).
-- 
Thierry Laronde tlaronde +AT+ polynum +dot+ com
  http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C



Re: [9fans] Octets regexp

2013-05-02 Thread erik quanstrom
 This is a reflexion made to me by a developer who can use, when
 needed, regexp (ed(1) or sed(1)) on an Unix where they still deal
 with char (bytes) to search for a string of bytes in a binary.

i have never needed to do this.  could you provide some motiviation
for grepping for a wierd byte in an executable?  surely the debugger
is better suited for this.

 And after some thought, I don't see an obvious reason why the regexp
 could not be used with bytes strings (so UTF-8 is OK) without trying to
 match runes (since not every bytes string is a correct UTF-8 sequence).

because it makes things more complicated and probablly worse for the
common case, while not providing an new functionality already in
other tools.

 Corollary: I don't know if there is an UTF-8 sequence that can tell:
 stop interpreting as UTF-8, takes as is (except every incorrect
 sequence, problem being to come back from there: if everything is OK as
 is, what can be interpreted as: stops raw, restart
 UTF-8---solution: this is on user level, not low level, and this is in
 the shell explicitely delimiting chunks, like ' is the only delimiter,
 and every embedded ' has to be escaped by doubling it).

i think you've missed the point of making utf-8 *the* character set.
it's not sometimes the character set.  or only on tuesday.  it's always
the character set.

- erik



Re: [9fans] Octets regexp

2013-05-02 Thread Tristan
putting a little more thought into your actual problem, use tcs:

tcs -f 8859-1

which (as i remember) will map 0x80-ff to U0080-00ff and you can use
normal utf8 regular expressions.

tristan

-- 
All original matter is hereby placed immediately under the public domain.



Re: [9fans] Octets regexp

2013-05-02 Thread tlaronde
On Thu, May 02, 2013 at 09:44:38AM -0400, erik quanstrom wrote:
  This is a reflexion made to me by a developer who can use, when
  needed, regexp (ed(1) or sed(1)) on an Unix where they still deal
  with char (bytes) to search for a string of bytes in a binary.
 
 i have never needed to do this.  could you provide some motiviation
 for grepping for a wierd byte in an executable?  surely the debugger
 is better suited for this.
 

Because everything is not a program? But maybe data? For example, the
TeX (or METAFONT etc.) predigested dumps are binary, but not program.

  And after some thought, I don't see an obvious reason why the regexp
  could not be used with bytes strings (so UTF-8 is OK) without trying to
  match runes (since not every bytes string is a correct UTF-8 sequence).
 
 because it makes things more complicated and probablly worse for the
 common case, while not providing an new functionality already in
 other tools.
 

Ah? I thought the purpose was to have not duplicated tools... And I'm
not quite sure it would be more complicated for common cases since already
defined functions could be wrappers calling more low level functions,
with the definition of the size of the entity---byte, wyde, tetra,
octa (when I'm at it: endianess too) or UTF-8.

 
 i think you've missed the point of making utf-8 *the* character set.
 it's not sometimes the character set.  or only on tuesday.  it's always
 the character set.
 
No: I have understood this. What I'm not totally sure about, is that the
system deals with octet strings (as it have), and this UTF-8 i.e.
Unicode is on the user interface, but is there a mean to not have the
interface interpret the strings as UTF-8? Because everything is not
text.

-- 
Thierry Laronde tlaronde +AT+ polynum +dot+ com
  http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C



Re: [9fans] Octets regexp

2013-05-02 Thread tlaronde
On Thu, May 02, 2013 at 09:43:10AM -0400, Tristan wrote:
  And after some thought, I don't see an obvious reason why the regexp
  could not be used with bytes strings (so UTF-8 is OK) without trying to
  match runes (since not every bytes string is a correct UTF-8 sequence).
 
 with octet based regexps, [Þþ] doesn't match þ, but 0xc3, 0xbe and 0x9e
 independantly.
 

Regexp knows subexpressions. So it could be achieved, and one could even
have the present functions be higher level ones, calling more basic ones
dealing with bytes (a rune specified by an UTF-8 sequence being replaced
by a subexpression) or even dealing with various sizes of element
(character; but one fixed size for the processing).

Or even a specification à la C: by adding a leading 'L' meaning:
treat the string as UTF-8 that is masters runes. And if not, leave
it alone.

-- 
Thierry Laronde tlaronde +AT+ polynum +dot+ com
  http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C



Re: [9fans] Octets regexp

2013-05-02 Thread a
your exact problem still isn't clear to me, but certainly there've
been times when I want to search for some array of characters
in a binary blob. i don't believe i've needed anything beyond a
literal string of bytes, but i could imagine from there the utility of
something regexp-like.

i think the answer is just no, there's no way to do that today.
and i'd strongly advise keeping that tool as far away from any
discussion of localization or character sets or runes or the like.
there's oughtn't be any mode switching or the like: it's utf-8
encoded unicode runes, or it's binary, not characters at all.

hex editors are useful sometimes. being able to do more
complicated searches/edits/replaces/whatever could be
similarly useful sometimes. but don't go anywhere near the
character set or localization discussions with it.

anthony




Re: [9fans] Octets regexp

2013-05-02 Thread Kurt H Maier
Why does this functionality have to be overloaded into existing tools
that are already in common use?  

khm



Re: [9fans] Octets regexp

2013-05-02 Thread erik quanstrom
 But that is exactly my point: to have localization far from regexp.
 Regexp taking simply a string of bytes and matching strings of bytes.

the plan 9 model is that all text is utf-8, with the exception of
internal encodings which may be Runes.

is your proposal 
- to change programs that take regular expressions to be exceptions to
the plan 9 text model, or
- to change the plan 9 text model
?

either way, i think the bar should be high to change the text model
for plan 9, and higher to make exceptions.

- erik



Re: [9fans] Octets regexp

2013-05-02 Thread tlaronde
On Thu, May 02, 2013 at 05:02:45PM +0200, Bence Fábián wrote:
 you want to change default behaviour and make the usual usecase special?
 

For the moment, I don't want to change anything, I'm trying to be
convinced where the border has to be: characters (for me user
level) on the one side, octets strings on the other system and
library side (on a distributed system, it makes sense that filenames,
being userlevel nicknames be UTF-8---supposed to be UTF-8 without any
per filename codepage or whatever).  

The usual behavior could perfectly be the same (the
leading L was just an exemple; it could be reversed; and octets
matching could simply be called by new functions---new names---, the
historical ones calling these new character agnostic ones). The
problem is not there. The problem is: are regexp only useful with
text implying characters, or more widely useful? My feeling is
that they are more generally useful.

-- 
Thierry Laronde tlaronde +AT+ polynum +dot+ com
  http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C



Re: [9fans] Octets regexp

2013-05-02 Thread tlaronde
On Thu, May 02, 2013 at 11:10:34AM -0400, Kurt H Maier wrote:
 Why does this functionality have to be overloaded into existing tools
 that are already in common use?  
 

I'm speaking about the libregexp. Not about the use existing tools do 
with it.
-- 
Thierry Laronde tlaronde +AT+ polynum +dot+ com
  http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C



Re: [9fans] Octets regexp

2013-05-02 Thread erik quanstrom
 For the moment, I don't want to change anything, I'm trying to be
 convinced where the border has to be: characters (for me user
 level) on the one side, octets strings on the other system and
 library side (on a distributed system, it makes sense that filenames,
 being userlevel nicknames be UTF-8---supposed to be UTF-8 without any
 per filename codepage or whatever).  

there is currently no such distinction between user and library.
this eliminates context.  one never is confronted with, oh, i can't
call that because that's a user function, not a library function.

- erik



Re: [9fans] Octets regexp

2013-05-02 Thread tlaronde
On Thu, May 02, 2013 at 11:19:38AM -0400, erik quanstrom wrote:
 
 the plan 9 model is that all text is utf-8, with the exception of
 internal encodings which may be Runes.
 
 is your proposal 
 - to change programs that take regular expressions to be exceptions to
 the plan 9 text model, or

No: to have a libregexp being agnostic about any encoding. The tools can
stay, for user, the same, simply libregexp would not be text based but
octets based.

 - to change the plan 9 text model

Neither. The text model is a user interface. My question is simply how
is it difficult to have an alternative, special purpose, user interface,
that do not have the UTF-8 filter for input from and output to the user
interface.

-- 
Thierry Laronde tlaronde +AT+ polynum +dot+ com
  http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C



Re: [9fans] Octets regexp

2013-05-02 Thread tlaronde
On Thu, May 02, 2013 at 02:38:25PM +0200, tlaro...@polynum.com wrote:
 Regexp(6) handles characters that are runes.
 

Answering to myself: regexp deals with entities called characters.
Some regexp specifications ('.', ranges, classes etc.) apply to 
characters.

This means that the size of the character has to be known, and one can
not deal directly with UTF-8 for example ignoring it is UTF-8 since '.'
for example is a variable size sequence, whose start depends on
what was before.

So a libregexp dealing with not only runes will be possible, but would
need to specify the fixed size of the characters, i.e. the encoding
of the input (this has nothing to do with localization; but with what is
an elementary entity). 

-- 
Thierry Laronde tlaronde +AT+ polynum +dot+ com
  http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C



Re: [9fans] Octets regexp

2013-05-02 Thread erik quanstrom
  is your proposal 
  - to change programs that take regular expressions to be exceptions to
  the plan 9 text model, or
 
 No: to have a libregexp being agnostic about any encoding. The tools can
 stay, for user, the same, simply libregexp would not be text based but
 octets based.

there's always an encoding.

  - to change the plan 9 text model
 
 Neither. The text model is a user interface. My question is simply how
 is it difficult to have an alternative, special purpose, user interface,
 that do not have the UTF-8 filter for input from and output to the user
 interface.

i see we're at an impass.  since i don't agree that utf-8 is a user
interface thing.  it's more entrenched than that.

why don't you code something up?

- erik



Re: [9fans] Octets regexp

2013-05-02 Thread dexen deVries
please pardon the silly question, but... how about piping the binary data 
through xd(1) before sending it to regexp(3)?

-- 
dexen deVries

[[[↓][→]]]

I have seen the Great Pretender and he is not what he seems.




Re: [9fans] Octets regexp

2013-05-02 Thread tlaronde
On Thu, May 02, 2013 at 12:53:19PM -0400, erik quanstrom wrote:
 
 i see we're at an impass.  since i don't agree that utf-8 is a user
 interface thing.  it's more entrenched than that.
 
 why don't you code something up?

Because I have started sketching (this was for kerTeX/RISK) basys i.e.
basic system tools, but I'm trying to decide whether I start from mainly
BSD tools (ash, libregex, sed, ed and the small set of utilities
used by RISK or by kerTeX package framework), or from Plan9 ones
(rc has some features that are worth them). But I want basys to
be a C language system---the system speaks Cee, and that's all;
a not integer number is given with a '.' and not a ',' for Frenchs
and so on (this is an example of POSIX hell: the *printf() and
*scanf() take the localization to decide how to interpret or render
numbers, and even if they are used to read files, not interacting
with the user, whatever user environment value spoils the thing if you
have not protected against in the code...), dealing with octets
strings (for user language, let them be UTF-8; but system strictly
doesn't care:  this is octets strings) and for libregex(p) the rune
thing does not appeal to me (correction: the only rune thing, even
if for a definition of character this does make sense).

I might as well end up with a modified sh or rc that deals with C
strings (with a L---for hell?---for UTF-8, nothing for octets, W for
wydes, T for tetras and O for octas and even a modifier for endianess).

But contrary to what is state of the art, I take long to study and
make things clear (to myself... YMMV), and after that I urge on
implementing in the direction I have chosen (it may take calendar
time; but this is simply because of limited slots of time; during
these slots I don't wonder about what has to be done: it is already
decided...). Till I have made the choice...

I have already decided that I will implement a bar(1) that only packs
the data with a volume listing in text whatever attributes in a form
attribute=value are linked to the data (this is, in some sens, what RISK
already does with rkinstall(1), except that it uses tar(1) to pack data).
That is bar(1) will be a pure C89 program without any system dependent
part (this will allow to do whatever with the data, for example changing
names to fit local conventions---the man hierarchy; compressing man
pages; caching the rendering; adding extensions etc.).
-- 
Thierry Laronde tlaronde +AT+ polynum +dot+ com
  http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C



Re: [9fans] Octets regexp

2013-05-02 Thread tlaronde
On Thu, May 02, 2013 at 08:45:28PM +0200, dexen deVries wrote:
 please pardon the silly question, but... how about piping the binary data 
 through xd(1) before sending it to regexp(3)?
 

Because it will work only for some cases, since newlines and formatting
come in the picture and it still imposes to have regexp rune compatible,
i.e. not every sequence is allowed it has to be an UTF-8 compatible one.

-- 
Thierry Laronde tlaronde +AT+ polynum +dot+ com
  http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C



Re: [9fans] Octets regexp

2013-05-02 Thread erik quanstrom
 On Thu, May 02, 2013 at 08:45:28PM +0200, dexen deVries wrote:
  please pardon the silly question, but... how about piping the binary data 
  through xd(1) before sending it to regexp(3)?
  
 
 Because it will work only for some cases, since newlines and formatting
 come in the picture and it still imposes to have regexp rune compatible,
 i.e. not every sequence is allowed it has to be an UTF-8 compatible one.

can you give an example of xd outputting something that's not a rune?

- erik



Re: [9fans] Octets regexp

2013-05-02 Thread tlaronde
On Thu, May 02, 2013 at 03:22:21PM -0400, erik quanstrom wrote:
 
 can you give an example of xd outputting something that's not a rune?
 

Indeed, if the regexp is an ASCII representation matching xd outputs
there is not _this_ problem. But this is limited regexp, since one can
not use character ranges (it depends on the size); not '.'; because 
the conversion has to be done; because there is still the newline
problem (that is added; not something in the original data) (if
functions have been added to not deal with the newline, it is because
the newline is a problem, and because regexp have a more wider use than
text).

-- 
Thierry Laronde tlaronde +AT+ polynum +dot+ com
  http://www.kergis.com/
Key fingerprint = 0FF7 E906 FBAF FE95 FD89  250D 52B1 AE95 6006 F40C



Re: [9fans] Octets regexp

2013-05-02 Thread erik quanstrom
 Indeed, if the regexp is an ASCII representation matching xd outputs
 there is not _this_ problem. But this is limited regexp, since one can
 not use character ranges (it depends on the size); not '.'; because 

now you're at both ends.  the whole reason for this approach is to
match bytes that aren't valid runes.  so why complain that it does
what you want?

- erik



Re: [9fans] Octets regexp

2013-05-02 Thread 9p-st
 On Thu, May 02, 2013 at 03:22:21PM -0400, erik quanstrom wrote:
  can you give an example of xd outputting something that's not a rune?

if we're talking about xd, i'll suggest 'tcs -f 8859-1' again in which case:

 Indeed, if the regexp is an ASCII representation matching xd outputs
 there is not _this_ problem. But this is limited regexp, since one can
 not use character ranges (it depends on the size); not '.';

these problems go away

 because the conversion has to be done;

this remains

 because there is still the newline problem (that is added; not something in
 the original data) (if functions have been added to not deal with the
 newline, it is because the newline is a problem, and because regexp have a
 more wider use than text).

and this problem goes away.

i imagine you'll still have problems with embedded NULs, but that's C
strings for you...

if you want a library function, use rregexec(2) and rregsub(2) with only
the low byte of each Rune filled...

(and yes, your data does quadruple itself)

tristan

-- 
All original matter is hereby placed immediately under the public domain.