Re: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

2014-06-06 Thread Steffen Nurpmeso
Doug Ewell d...@ewellic.org wrote: |Philippe Verdy verdy underscore p at wanadoo dot fr wrote: | Not necessarily true. | | [602 words] | |This has nothing to do with the scenario I described, which involved |removing a BOM from the start of an arbitrary fragment of data, |thereby

Re: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

2014-06-06 Thread Doug Ewell
Philippe Verdy verdy underscore p at wanadoo dot fr wrote: If you have an arbitrary fragment of data, don't fiddle with it. Thisis your scenario. The simple concept of a unique start of text does not exist in live streams that can start anywhere. So you cannot always expect that U+FEFF or

Re: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

2014-06-05 Thread Philippe Verdy
2014-06-05 0:48 GMT+02:00 Doug Ewell d...@ewellic.org: If you are processing arbitrary fragments of a stream, without knowledge of preceding fragments, as in this example, then you have no business making *any* changes to that fragment based on interpretation of that fragment as Unicode text.

Re: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

2014-06-05 Thread Richard Wordingham
On Thu, 5 Jun 2014 09:41:07 +0200 Philippe Verdy verd...@wanadoo.fr wrote: You'll probably want to sync on the first newline control and then proceed from that point. But now if you have those devices configured heterogenously and generating their own output encoding you won't necessarily

Re: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

2014-06-05 Thread Doug Ewell
Philippe Verdy verdy underscore p at wanadoo dot fr wrote: Not necessarily true. [602 words] This has nothing to do with the scenario I described, which involved removing a BOM from the start of an arbitrary fragment of data, thereby corrupting the data because the BOM was actually a ZWNBSP.

Re: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

2014-06-05 Thread Philippe Verdy
2014-06-05 21:46 GMT+02:00 Doug Ewell d...@ewellic.org: Philippe Verdy verdy underscore p at wanadoo dot fr wrote: Not necessarily true. [602 words] This has nothing to do with the scenario I described, which involved removing a BOM from the start of an arbitrary fragment of data,

Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

2014-06-04 Thread Doug Ewell
How common is it to see any of the following in real-world Unicode text, as opposed to code charts and test suites and the like? 1. Unpaired surrogates 2. Noncharacters (besides CLDR data) 3. U+FEFF at the beginning of a stream (note: not packet or arbitrary cutoff point) I'm not asking whether

RE: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

2014-06-04 Thread Shawn Steele
a text transport or something. Usually that bites them sooner or later. -Shawn -Original Message- From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Doug Ewell Sent: Wednesday, June 4, 2014 11:01 AM To: unicode@unicode.org Subject: Corner cases (was: Re: UTF-16 Encoding

RE: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

2014-06-04 Thread Doug Ewell
Sorry, I left out an important detail. I wrote: 3. U+FEFF at the beginning of a stream (note: not packet or arbitrary cutoff point) I meant U+FEFF as a zero-width no-break space. Obviously it is very common to see U+FEFF as a signature or BOM. My underlying question here is, how common is

Re: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

2014-06-04 Thread Asmus Freytag
On 6/4/2014 11:26 AM, Doug Ewell wrote: Sorry, I left out an important detail. I wrote: 3. U+FEFF at the beginning of a stream (note: not packet or arbitrary cutoff point) I meant U+FEFF as a zero-width no-break space. Obviously it is very common to see U+FEFF as a signature or BOM. My

Re: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

2014-06-04 Thread Richard Wordingham
On Wed, 04 Jun 2014 11:40:11 -0700 Asmus Freytag asm...@ix.netcom.com wrote: On 6/4/2014 11:26 AM, Doug Ewell wrote: I meant U+FEFF as a zero-width no-break space. Obviously it is very common to see U+FEFF as a signature or BOM. The semantics of it were chosen at the time to make no sense

Re: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

2014-06-04 Thread Asmus Freytag
On 6/4/2014 12:21 PM, Richard Wordingham wrote: On Wed, 04 Jun 2014 11:40:11 -0700 Asmus Freytag asm...@ix.netcom.com wrote: On 6/4/2014 11:26 AM, Doug Ewell wrote: I meant U+FEFF as a zero-width no-break space. Obviously it is very common to see U+FEFF as a signature or BOM. The semantics

Re: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

2014-06-04 Thread Doug Ewell
Richard Wordingham richard dot wordingham at ntlworld dot com wrote: The example that's usually given [of U+FEFF at the start of a stream] is that of a text file sliced into segments to avoid file size limits. In these cases, there is the risk that U+FEFF as ZWNBSP will wind up at the start