Re: [9fans] Octets regexp
On Thu, May 02, 2013 at 04:17:11PM -0400, 9p...@imu.li wrote: if we're talking about xd, i'll suggest 'tcs -f 8859-1' again in which case: My question was _not_ related to text, and _not_ related to french i.e. 8859-1. I know how to deal with this. -- Thierry Laronde tlaronde +AT+ polynum +dot+ com http://www.kergis.com/ Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
Re: [9fans] Octets regexp
if we're talking about xd, i'll suggest 'tcs -f 8859-1' again in which case: My question was _not_ related to text, and _not_ related to french i.e. 8859-1. I know how to deal with this. tcs -f 8859-1 will take your _binary_ files, and replace the bytes 0x80-0xff with the unicode points U0080-U00ff, so you can use the standard regexps and tools on them. and just convert back afterwards. maybe it's not meant to be used that way, but it _works_. try it. have fun! tristan -- All original matter is hereby placed immediately under the public domain.
Re: [9fans] Octets regexp
On Fri, May 03, 2013 at 09:15:27AM -0400, Tristan wrote: tcs -f 8859-1 will take your _binary_ files, and replace the bytes 0x80-0xff with the unicode points U0080-U00ff, so you can use the standard regexps and tools on them. and just convert back afterwards. OK, mea culpa... since I'm french, I focused on the latin1 thinking this has something to do with my language and the custom to deal with latin1 on other systems. I guess I could create a keyboard that produces not UTF-8 but bytes so to have a mean to input bytes (without resorting to printf or whatever). Remains the problem of the rendering (or create a special font that displays octal, hexadecimal or whatever playing with the index of the glyphes; but this will work for octets, and will be more difficult if one wants to deal with wydes; impossible with tetras and octas). -- Thierry Laronde tlaronde +AT+ polynum +dot+ com http://www.kergis.com/ Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
Re: [9fans] Octets regexp
Regexp(6) handles characters that are runes. perhaps the man page is misleading. rune in this context means utf-8. see regexp(2). all the functions take char*s. I wonder if Plan9 developers, when trying to design a way towards some localization, have ever thought of bytes (octets) regexp, that is using regexp with not rune but octets strings (maybe UTF-8 as is) allowing to use regexp with binary too, not only newline terminated chunks etc.? one of the points of plan 9 was to standardize on one character set, utf-8. imho, localization and character set aren't related unless one is dealing with 8859-x overlays or some other character set insufficient to represent the range of languages. however, sam and acme allow for structured regular expressions, and are generally not line oriented: http://doc.cat-v.org/bell_labs/structural_regexps/se.pdf and iirc, cinap has written a cifs bit that uses a bit of binary matching. - erik
Re: [9fans] Octets regexp
On Thu, May 02, 2013 at 08:48:06AM -0400, erik quanstrom wrote: Regexp(6) handles characters that are runes. perhaps the man page is misleading. rune in this context means utf-8. see regexp(2). all the functions take char*s. But the source files deal with runes... one of the points of plan 9 was to standardize on one character set, utf-8. imho, localization and character set aren't related unless one is dealing with 8859-x overlays or some other character set insufficient to represent the range of languages. Localization (as handled in POSIX for example) is a mess. So the Plan9 solution, with still octets (UTF-8) makes far more sense, since it allows to extend, for the user, the characters that can be used in naming computer objects, but this is just for nicknames: the system still speaks C/9P. So it is better, except perhaps for one thing: for me, the system speaks C or even, obviously, Plan9 (well: 9P). It does not have to speak french, hebrew, etc. or even english! So it takes or gives bytes, and this is good. But the UTF-8 encoding is the main convention for user interface, but can it be unset? I mean, can one use a raw window, putting uninterpreted bytes, and rendering bytes (with a special ASCII font with whether ASCII + 0xdd glyphes or whatever, using fonts to do what is done with vis(1) on Unices or od(1)/xd(1)) and do not impose the assumption that the octet strings is UTF-8? Can one make a file entering bytes---i.e. binary values that yield incorrect UTF-8 sequences? This is a reflexion made to me by a developer who can use, when needed, regexp (ed(1) or sed(1)) on an Unix where they still deal with char (bytes) to search for a string of bytes in a binary. And after some thought, I don't see an obvious reason why the regexp could not be used with bytes strings (so UTF-8 is OK) without trying to match runes (since not every bytes string is a correct UTF-8 sequence). Corollary: I don't know if there is an UTF-8 sequence that can tell: stop interpreting as UTF-8, takes as is (except every incorrect sequence, problem being to come back from there: if everything is OK as is, what can be interpreted as: stops raw, restart UTF-8---solution: this is on user level, not low level, and this is in the shell explicitely delimiting chunks, like ' is the only delimiter, and every embedded ' has to be escaped by doubling it). -- Thierry Laronde tlaronde +AT+ polynum +dot+ com http://www.kergis.com/ Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
Re: [9fans] Octets regexp
This is a reflexion made to me by a developer who can use, when needed, regexp (ed(1) or sed(1)) on an Unix where they still deal with char (bytes) to search for a string of bytes in a binary. i have never needed to do this. could you provide some motiviation for grepping for a wierd byte in an executable? surely the debugger is better suited for this. And after some thought, I don't see an obvious reason why the regexp could not be used with bytes strings (so UTF-8 is OK) without trying to match runes (since not every bytes string is a correct UTF-8 sequence). because it makes things more complicated and probablly worse for the common case, while not providing an new functionality already in other tools. Corollary: I don't know if there is an UTF-8 sequence that can tell: stop interpreting as UTF-8, takes as is (except every incorrect sequence, problem being to come back from there: if everything is OK as is, what can be interpreted as: stops raw, restart UTF-8---solution: this is on user level, not low level, and this is in the shell explicitely delimiting chunks, like ' is the only delimiter, and every embedded ' has to be escaped by doubling it). i think you've missed the point of making utf-8 *the* character set. it's not sometimes the character set. or only on tuesday. it's always the character set. - erik
Re: [9fans] Octets regexp
putting a little more thought into your actual problem, use tcs: tcs -f 8859-1 which (as i remember) will map 0x80-ff to U0080-00ff and you can use normal utf8 regular expressions. tristan -- All original matter is hereby placed immediately under the public domain.
Re: [9fans] Octets regexp
On Thu, May 02, 2013 at 09:44:38AM -0400, erik quanstrom wrote: This is a reflexion made to me by a developer who can use, when needed, regexp (ed(1) or sed(1)) on an Unix where they still deal with char (bytes) to search for a string of bytes in a binary. i have never needed to do this. could you provide some motiviation for grepping for a wierd byte in an executable? surely the debugger is better suited for this. Because everything is not a program? But maybe data? For example, the TeX (or METAFONT etc.) predigested dumps are binary, but not program. And after some thought, I don't see an obvious reason why the regexp could not be used with bytes strings (so UTF-8 is OK) without trying to match runes (since not every bytes string is a correct UTF-8 sequence). because it makes things more complicated and probablly worse for the common case, while not providing an new functionality already in other tools. Ah? I thought the purpose was to have not duplicated tools... And I'm not quite sure it would be more complicated for common cases since already defined functions could be wrappers calling more low level functions, with the definition of the size of the entity---byte, wyde, tetra, octa (when I'm at it: endianess too) or UTF-8. i think you've missed the point of making utf-8 *the* character set. it's not sometimes the character set. or only on tuesday. it's always the character set. No: I have understood this. What I'm not totally sure about, is that the system deals with octet strings (as it have), and this UTF-8 i.e. Unicode is on the user interface, but is there a mean to not have the interface interpret the strings as UTF-8? Because everything is not text. -- Thierry Laronde tlaronde +AT+ polynum +dot+ com http://www.kergis.com/ Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
Re: [9fans] Octets regexp
On Thu, May 02, 2013 at 09:43:10AM -0400, Tristan wrote: And after some thought, I don't see an obvious reason why the regexp could not be used with bytes strings (so UTF-8 is OK) without trying to match runes (since not every bytes string is a correct UTF-8 sequence). with octet based regexps, [Þþ] doesn't match þ, but 0xc3, 0xbe and 0x9e independantly. Regexp knows subexpressions. So it could be achieved, and one could even have the present functions be higher level ones, calling more basic ones dealing with bytes (a rune specified by an UTF-8 sequence being replaced by a subexpression) or even dealing with various sizes of element (character; but one fixed size for the processing). Or even a specification à la C: by adding a leading 'L' meaning: treat the string as UTF-8 that is masters runes. And if not, leave it alone. -- Thierry Laronde tlaronde +AT+ polynum +dot+ com http://www.kergis.com/ Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
Re: [9fans] Octets regexp
your exact problem still isn't clear to me, but certainly there've been times when I want to search for some array of characters in a binary blob. i don't believe i've needed anything beyond a literal string of bytes, but i could imagine from there the utility of something regexp-like. i think the answer is just no, there's no way to do that today. and i'd strongly advise keeping that tool as far away from any discussion of localization or character sets or runes or the like. there's oughtn't be any mode switching or the like: it's utf-8 encoded unicode runes, or it's binary, not characters at all. hex editors are useful sometimes. being able to do more complicated searches/edits/replaces/whatever could be similarly useful sometimes. but don't go anywhere near the character set or localization discussions with it. anthony
Re: [9fans] Octets regexp
Why does this functionality have to be overloaded into existing tools that are already in common use? khm
Re: [9fans] Octets regexp
But that is exactly my point: to have localization far from regexp. Regexp taking simply a string of bytes and matching strings of bytes. the plan 9 model is that all text is utf-8, with the exception of internal encodings which may be Runes. is your proposal - to change programs that take regular expressions to be exceptions to the plan 9 text model, or - to change the plan 9 text model ? either way, i think the bar should be high to change the text model for plan 9, and higher to make exceptions. - erik
Re: [9fans] Octets regexp
On Thu, May 02, 2013 at 05:02:45PM +0200, Bence Fábián wrote: you want to change default behaviour and make the usual usecase special? For the moment, I don't want to change anything, I'm trying to be convinced where the border has to be: characters (for me user level) on the one side, octets strings on the other system and library side (on a distributed system, it makes sense that filenames, being userlevel nicknames be UTF-8---supposed to be UTF-8 without any per filename codepage or whatever). The usual behavior could perfectly be the same (the leading L was just an exemple; it could be reversed; and octets matching could simply be called by new functions---new names---, the historical ones calling these new character agnostic ones). The problem is not there. The problem is: are regexp only useful with text implying characters, or more widely useful? My feeling is that they are more generally useful. -- Thierry Laronde tlaronde +AT+ polynum +dot+ com http://www.kergis.com/ Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
Re: [9fans] Octets regexp
On Thu, May 02, 2013 at 11:10:34AM -0400, Kurt H Maier wrote: Why does this functionality have to be overloaded into existing tools that are already in common use? I'm speaking about the libregexp. Not about the use existing tools do with it. -- Thierry Laronde tlaronde +AT+ polynum +dot+ com http://www.kergis.com/ Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
Re: [9fans] Octets regexp
For the moment, I don't want to change anything, I'm trying to be convinced where the border has to be: characters (for me user level) on the one side, octets strings on the other system and library side (on a distributed system, it makes sense that filenames, being userlevel nicknames be UTF-8---supposed to be UTF-8 without any per filename codepage or whatever). there is currently no such distinction between user and library. this eliminates context. one never is confronted with, oh, i can't call that because that's a user function, not a library function. - erik
Re: [9fans] Octets regexp
On Thu, May 02, 2013 at 11:19:38AM -0400, erik quanstrom wrote: the plan 9 model is that all text is utf-8, with the exception of internal encodings which may be Runes. is your proposal - to change programs that take regular expressions to be exceptions to the plan 9 text model, or No: to have a libregexp being agnostic about any encoding. The tools can stay, for user, the same, simply libregexp would not be text based but octets based. - to change the plan 9 text model Neither. The text model is a user interface. My question is simply how is it difficult to have an alternative, special purpose, user interface, that do not have the UTF-8 filter for input from and output to the user interface. -- Thierry Laronde tlaronde +AT+ polynum +dot+ com http://www.kergis.com/ Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
Re: [9fans] Octets regexp
On Thu, May 02, 2013 at 02:38:25PM +0200, tlaro...@polynum.com wrote: Regexp(6) handles characters that are runes. Answering to myself: regexp deals with entities called characters. Some regexp specifications ('.', ranges, classes etc.) apply to characters. This means that the size of the character has to be known, and one can not deal directly with UTF-8 for example ignoring it is UTF-8 since '.' for example is a variable size sequence, whose start depends on what was before. So a libregexp dealing with not only runes will be possible, but would need to specify the fixed size of the characters, i.e. the encoding of the input (this has nothing to do with localization; but with what is an elementary entity). -- Thierry Laronde tlaronde +AT+ polynum +dot+ com http://www.kergis.com/ Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
Re: [9fans] Octets regexp
is your proposal - to change programs that take regular expressions to be exceptions to the plan 9 text model, or No: to have a libregexp being agnostic about any encoding. The tools can stay, for user, the same, simply libregexp would not be text based but octets based. there's always an encoding. - to change the plan 9 text model Neither. The text model is a user interface. My question is simply how is it difficult to have an alternative, special purpose, user interface, that do not have the UTF-8 filter for input from and output to the user interface. i see we're at an impass. since i don't agree that utf-8 is a user interface thing. it's more entrenched than that. why don't you code something up? - erik
Re: [9fans] Octets regexp
please pardon the silly question, but... how about piping the binary data through xd(1) before sending it to regexp(3)? -- dexen deVries [[[↓][→]]] I have seen the Great Pretender and he is not what he seems.
Re: [9fans] Octets regexp
On Thu, May 02, 2013 at 12:53:19PM -0400, erik quanstrom wrote: i see we're at an impass. since i don't agree that utf-8 is a user interface thing. it's more entrenched than that. why don't you code something up? Because I have started sketching (this was for kerTeX/RISK) basys i.e. basic system tools, but I'm trying to decide whether I start from mainly BSD tools (ash, libregex, sed, ed and the small set of utilities used by RISK or by kerTeX package framework), or from Plan9 ones (rc has some features that are worth them). But I want basys to be a C language system---the system speaks Cee, and that's all; a not integer number is given with a '.' and not a ',' for Frenchs and so on (this is an example of POSIX hell: the *printf() and *scanf() take the localization to decide how to interpret or render numbers, and even if they are used to read files, not interacting with the user, whatever user environment value spoils the thing if you have not protected against in the code...), dealing with octets strings (for user language, let them be UTF-8; but system strictly doesn't care: this is octets strings) and for libregex(p) the rune thing does not appeal to me (correction: the only rune thing, even if for a definition of character this does make sense). I might as well end up with a modified sh or rc that deals with C strings (with a L---for hell?---for UTF-8, nothing for octets, W for wydes, T for tetras and O for octas and even a modifier for endianess). But contrary to what is state of the art, I take long to study and make things clear (to myself... YMMV), and after that I urge on implementing in the direction I have chosen (it may take calendar time; but this is simply because of limited slots of time; during these slots I don't wonder about what has to be done: it is already decided...). Till I have made the choice... I have already decided that I will implement a bar(1) that only packs the data with a volume listing in text whatever attributes in a form attribute=value are linked to the data (this is, in some sens, what RISK already does with rkinstall(1), except that it uses tar(1) to pack data). That is bar(1) will be a pure C89 program without any system dependent part (this will allow to do whatever with the data, for example changing names to fit local conventions---the man hierarchy; compressing man pages; caching the rendering; adding extensions etc.). -- Thierry Laronde tlaronde +AT+ polynum +dot+ com http://www.kergis.com/ Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
Re: [9fans] Octets regexp
On Thu, May 02, 2013 at 08:45:28PM +0200, dexen deVries wrote: please pardon the silly question, but... how about piping the binary data through xd(1) before sending it to regexp(3)? Because it will work only for some cases, since newlines and formatting come in the picture and it still imposes to have regexp rune compatible, i.e. not every sequence is allowed it has to be an UTF-8 compatible one. -- Thierry Laronde tlaronde +AT+ polynum +dot+ com http://www.kergis.com/ Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
Re: [9fans] Octets regexp
On Thu, May 02, 2013 at 08:45:28PM +0200, dexen deVries wrote: please pardon the silly question, but... how about piping the binary data through xd(1) before sending it to regexp(3)? Because it will work only for some cases, since newlines and formatting come in the picture and it still imposes to have regexp rune compatible, i.e. not every sequence is allowed it has to be an UTF-8 compatible one. can you give an example of xd outputting something that's not a rune? - erik
Re: [9fans] Octets regexp
On Thu, May 02, 2013 at 03:22:21PM -0400, erik quanstrom wrote: can you give an example of xd outputting something that's not a rune? Indeed, if the regexp is an ASCII representation matching xd outputs there is not _this_ problem. But this is limited regexp, since one can not use character ranges (it depends on the size); not '.'; because the conversion has to be done; because there is still the newline problem (that is added; not something in the original data) (if functions have been added to not deal with the newline, it is because the newline is a problem, and because regexp have a more wider use than text). -- Thierry Laronde tlaronde +AT+ polynum +dot+ com http://www.kergis.com/ Key fingerprint = 0FF7 E906 FBAF FE95 FD89 250D 52B1 AE95 6006 F40C
Re: [9fans] Octets regexp
Indeed, if the regexp is an ASCII representation matching xd outputs there is not _this_ problem. But this is limited regexp, since one can not use character ranges (it depends on the size); not '.'; because now you're at both ends. the whole reason for this approach is to match bytes that aren't valid runes. so why complain that it does what you want? - erik
Re: [9fans] Octets regexp
On Thu, May 02, 2013 at 03:22:21PM -0400, erik quanstrom wrote: can you give an example of xd outputting something that's not a rune? if we're talking about xd, i'll suggest 'tcs -f 8859-1' again in which case: Indeed, if the regexp is an ASCII representation matching xd outputs there is not _this_ problem. But this is limited regexp, since one can not use character ranges (it depends on the size); not '.'; these problems go away because the conversion has to be done; this remains because there is still the newline problem (that is added; not something in the original data) (if functions have been added to not deal with the newline, it is because the newline is a problem, and because regexp have a more wider use than text). and this problem goes away. i imagine you'll still have problems with embedded NULs, but that's C strings for you... if you want a library function, use rregexec(2) and rregsub(2) with only the low byte of each Rune filled... (and yes, your data does quadruple itself) tristan -- All original matter is hereby placed immediately under the public domain.