On Thu, Apr 26, 2012 at 11:53 PM, David Korn <[email protected]> wrote: > cc: [email protected] > Subject: Re: Re: [ast-users] Re: read -d command not supporting > non-ASCII/Unicode chars > -------- > >> What exactly is the difficult part? ksh93 already supports one byte >> delimiters. Non-ASCII characters can both be represented by a wchar_t >> or a multibyte sequence. The multibyte sequence could be used as C >> string and this C string could be used as delimiter, i.e. you search >> for a C string as delimiter instead of a single byte. > > The problem is that ksh93 uses the sfio library function sfgetr() to read > a record and this has to be modified to handle multi-byte. > > Secondly, when reading from a terminal, you can't just set the VEOL character > to be the delimiter since it is now multi-byte so it may required > a raw mode read.
3rd issue: Not all multibyte encodings allow "recovering" (sometimes called "self-synchronizing") from the issue when the file offset points (in-)to the middle of a multibyte charatacter byte sequence (e.g. in this case the remainder of the text will read as gibberish (exception: All encodings I know about allow recovering at the '\n' (=newline) character)). UTF-8 was explicitly designed (credits go to Ken Thompson for that idea (and implementation)) that consumers _can_ recover but IMO the sfio code shouldn't be UTF-8 specific (e.g. there are other modern encodings/character sets/standards like GB18030 which are not UTF-8 encoded but still are very important (in GB18030's case the Chinese goverment makes GB18030 support mandatory, e.g. you can't see your software there without GB18030 support)) > We plan to made this changes after the next update which we hope to release > at the end of this week. Uhm... how are you and kpv intending to fix it ? IMO it would be sufficient to assume that the file position is always at the position of the next valid multibyte character (and any caller who does a |seek()| is responsible to gurantee this (or gets eaten by the resulting mess)). ---- Bye, Roland -- __ . . __ (o.\ \/ /.o) [email protected] \__\/\/__/ MPEG specialist, C&&JAVA&&Sun&&Unix programmer /O /==\ O\ TEL +49 641 3992797 (;O/ \/ \O;) _______________________________________________ ast-users mailing list [email protected] https://mailman.research.att.com/mailman/listinfo/ast-users
