Re: Unicode "no-op" Character?
Hello again everyone, Though I initially took the shoo-away, there have been some comments made since then that I feel compelled to rebut. To avoid spamming the list, I’ve combined my responses into a single message. Before that, I will say, again, for the record: I know this NOOP idea is unlikely to ever happen. Certainly not with the responses I've gotten. I haven't submitted it, nor even looked into how to. I know it would be rejected. This is a thought experiment, nothing more. If that doesn't interest you, please disregard this message. And again, the hypothetical NOOP is a character whose canonical equivalent is the absence of a character. The logical consequences of that statement apply fully. On Wed, Jul 3, 2019 at 8:00 PM Shawn Steele via Unicode wrote: > > Even more complicated is that, as pointed out by others, it's pretty much > impossible to say "these n codepoints should be ignored and have no meaning" > because some process would try to use codepoints 1-3 for some private > meaning. Another would use codepoint 1 for their own thing, and there'd be a > conflict. This is so utterly, completely, and severely missing the point I'm starting to feel like a madman screaming to the heavens, "Why can't they just understand?!" Yes, a different process will have a different private meaning for the codepoint. That is not a bug, it is a feature. A conflict is always resolved by the current process saying, "I'm holding the string now. The old NOOPs are gone, canonically decomposed to nothing. The new ones mean what I want them to mean, as long as I or my buddies hold the string. If you didn't want that, you shouldn't have given the string to me!" This conflict-resolution mechanism is the special sauce. If a process needs a private marker that will be preserved in interchange, there are plenty of PUA characters to use, and even a couple of private control characters. > I also think that the conversation has pretty much proven that such a system > is mathematically impossible. (You can't have a "private" no-meaning > codepoint that won't conflict with other "private" uses in a public space). No such thing has been proven in the slightest. Any conflict is resolved, in the default case, by normalizing all NOOPs to nothing. On Wed, Jul 3, 2019 at 5:46 PM Mark E. Shoulson via Unicode wrote: > > Um... How could you be sure that process X would get the no-ops that process > W wrote? After all, it's *discardable*, like you said, and the database > programs and libraries aren't in on the secret. Yes, there is a requirement that W and X communicate via some "NOOP-preserving path" (call it a NOOPPP). Such paths would generally be very short and direct, because NOOPs are intended to be ephemeral, not archival! They wouldn't be hard to come by. Memory mappings or pipes. Direct inter-process comms. Anything that operates at byte-level. Even simple persisting mechanisms like file storage or databases can preserve NOOP by doing... nothing. "Discardable" doesn't mean it must be discarded, merely that it can be. Where there are no security implications or other need, strings containing NOOP can simply be passed through and stored as-is. Where any interface, library, or process does not preserve NOOP, it cannot be part of a NOOPPP. Tough luck. > Moreover, as you say, what about when Process Z (or its companions) comes > along and is using THE SAME MECHANISM for something utterly different? How > does it know that process W wasn't writing no-ops for it, but was writing > them for Process X? It is the responsibility of Process Z (and any process that interprets NOOPs non-trivially) to be aware of the context/source of what it's receiving. Prior agreement or advertised contract. On Wed, Jul 3, 2019 at 2:06 PM Rebecca Bettencourt wrote: > > And the database driver filters out the U+000F completely as a matter of best > practice and security-in-depth. I'm struggling to see the security implication of "store this string, verbatim, in your regular VARCHAR (or whatever) text field". I can store the string "DROP TABLE [STUDENTS];" in a text field and unless the database is horribly broken it will store that without issue. A database could strip out NOOP out of text fields and still claim to be Unicode conformant. But I wonder why it would bother to do that. And even then, you could just store the string in a VARBINARY field or whatever just accepts bytes. > You can't say "this character should be ignored everywhere" and "this > character should be preserved everywhere" at the same time. That's the > contradiction. I have not said "this character should be preserved everywhere". That statement is completely false. Unfortunately, that means what I said is still not being understood at all. Forgive me for being frustrated. Finally, a general comment: I think people are getting hung-up on this idea because they’re still thinking in terms of what is being guaranteed, while this is explicitly about
RE: Unicode "no-op" Character?
Shawn Steele wrote: > Even more complicated is that, as pointed out by others, it's pretty > much impossible to say "these n codepoints should be ignored and have > no meaning" because some process would try to use codepoints 1-3 for > some private meaning. Another would use codepoint 1 for their own > thing, and there'd be a conflict. That's pretty much what happened with NUL. It was originally intended (long, long before Unicode) to be ignorable and have no meaning, but then other processes were designed that gave it specific meaning, and that was pretty much that. While the Unix/C "end of string" convention was not the only case in which NUL was hijacked, it is certainly the best-known, and the greatest impediment to any current attempt to use it with its original meaning. -- Doug Ewell | Thornton, CO, US | ewellic.org
RE: Unicode "no-op" Character?
I think you're overstating my concern :) I meant that those things tend to be particular to a certain context and often aren't interesting for interchange. A text editor might find it convenient to place word boundaries in the middle of something another part of the system thinks is a single unit to be rendered. At the same time, a rendering engine might find it interesting that there's an ff together and want to mark it to be shown as a ligature though that text editor wouldn't be keen on that at all. As has been said, these are private mechanisms for things that individual processes find interesting. It's not useful to mark those for interchange as the text editors word breaking marks would interfere with the graphics engines glyph breaking marks. Not to mention the transmission buffer size marks originally mentioned, which could be anywhere. The "right" thing to do here is to use an internal higher level mechanism to keep track of these things however the component needs. That can even be interchanged with another component designed to the same principles, via mechanisms like the PUA. However, those components can't expect their private mechanisms are useful or harmless to other processes. Even more complicated is that, as pointed out by others, it's pretty much impossible to say "these n codepoints should be ignored and have no meaning" because some process would try to use codepoints 1-3 for some private meaning. Another would use codepoint 1 for their own thing, and there'd be a conflict. As a thought experiment, I think it's certainly decent to ask the question "could such a mechanism be useful?" It's an intriguing thought and a decent hypothesis that this kind of system could be privately useful to an application. I also think that the conversation has pretty much proven that such a system is mathematically impossible. (You can't have a "private" no-meaning codepoint that won't conflict with other "private" uses in a public space). It might be worth noting that this kind of thing used to be fairly common in early computing. Word processers would inject a "CTRL-I" token to toggle italics on or off. Old printers used to use sequences to define the start of bold or italic or underlined or whatever sequences. Those were private and pseudo-private mechanisms that were used internally &/or documented for others that wanted to interoperate with their systems. (The printer folks would tell the word processers how to make italics happen, then other printer folks would use the same or similar mechanisms for compatibility - except for the dude that didn't get the memo and made their own scheme.) Unicode was explicitly intended *not* to encode any of that kind of markup, and, instead, be "plain text," leaving other interesting metadata to other higher level protocols. Whether those be word breaking, sentence parsing, formatting, buffer sizing or whatever. -Shawn -Original Message- From: Unicode On Behalf Of Richard Wordingham via Unicode Sent: Wednesday, July 3, 2019 4:20 PM To: unicode@unicode.org Subject: Re: Unicode "no-op" Character? On Wed, 3 Jul 2019 17:51:29 -0400 "Mark E. Shoulson via Unicode" wrote: > I think the idea being considered at the outset was not so complex as > these (and indeed, the point of the character was to avoid making > these kinds of decisions). Shawn Steele appeared to be claiming that there was no good, interesting reason for separating base character and combining mark. I was refuting that notion. Natural text boundaries can get very messy - some languages have word boundaries that can be *within* an indecomposable combining mark. Richard.
Re: Unicode "no-op" Character?
On Wed, 3 Jul 2019 17:51:29 -0400 "Mark E. Shoulson via Unicode" wrote: > I think the idea being considered at the outset was not so complex as > these (and indeed, the point of the character was to avoid making > these kinds of decisions). Shawn Steele appeared to be claiming that there was no good, interesting reason for separating base character and combining mark. I was refuting that notion. Natural text boundaries can get very messy - some languages have word boundaries that can be *within* an indecomposable combining mark. Richard.
Re: Unicode "no-op" Character?
What you're asking for, then, is completely possible and achievable—but not in the Unicode Standard. It's out of scope for Unicode, it sounds like. You've said you realize it won't happen in Unicode, but it still can happen. Go forth and implement it, then: make your higher-level protocol and show its usefulness and get the industry to use and honor it because of how handy it is, and best of luck with that. ~mark On 7/3/19 2:22 PM, Ken Whistler via Unicode wrote: On 7/3/2019 10:47 AM, Sławomir Osipiuk via Unicode wrote: Is my idea impossible, useless, or contradictory? Not at all. What you are proposing is in the realm of higher-level protocols. You could develop such a protocol, and then write processes that honored it, or try to convince others to write processes to honor it. You could use PUA characters, or non-characters, or existing control codes -- the implications for use of any of those would be slightly different, in practice, but in any case would be an HLP. But your idea is not a feasible part of the Unicode Standard. There are no "discardable" characters in Unicode -- *by definition*. The discussion of "ignorable" characters in the standard is nuanced and complicated, because there are some characters which are carefully designed to be transparent to some, well-specified processes, but not to others. But no characters in the standard are (or can be) ignorable by *all* processes, nor can a "discardable" character ever be defined as part of the standard. The fact that there are a myriad of processes implemented (and distributed who knows where) that do 7-bit ASCII (or 8-bit 8859-1) conversion to/from UTF-16 by integral type conversion is a simple existence proof that U+000F is never, ever, ever, ever going to be defined to be "discardable" in the Unicode Standard. --Ken
Re: Unicode "no-op" Character?
I think the idea being considered at the outset was not so complex as these (and indeed, the point of the character was to avoid making these kinds of decisions). There was a desire for some reason to be able to chop up a string into equal-length pieces or something, and some of those divisions might wind up between bases and diacritics or who knows where else. Rather than have to work out acceptable places to place the characters, the request was for a no-op character that could safely be plopped *anywhere*, even in the middle of combinations like that. ~mark On 6/23/19 4:24 AM, Richard Wordingham via Unicode wrote: On Sat, 22 Jun 2019 23:56:50 + Shawn Steele via Unicode wrote: + the list. For some reason the list's reply header is confusing. From: Shawn Steele Sent: Saturday, June 22, 2019 4:55 PM To: Sławomir Osipiuk Subject: RE: Unicode "no-op" Character? The original comment about putting it between the base character and the combining diacritic seems peculiar. I'm having a hard time visualizing how that kind of markup could be interesting? There are a number of possible interesting scenarios: 1) Chopping the string into user perceived characters. For example, the Khmer sequences of COENG plus letter are named sequences. Akin to this is identifying resting places for a simple cursor, e.g. allowing it to be positioned between a base character and a spacing, unreordered subscript. (This last possibility overlaps with rendering.) 2) Chopping the string into collating elements. (This can require renormalisation, and may raise a rendering issue with HarfBuzz, where renomalisation is required to get marks into a suitable order for shaping. I suspect no-op characters would disrupt this renormalisation; CGJ may legitimately be used to affect rendering this way, even though it is supposed to have no other effect* on rendering.) 3) Chopping the string into default grapheme clusters. That separates a coeng from the following character with which it interacts. *Is a Unicode-compliant *renderer* allowed to distinguish diaeresis from the umlaut mark? Richard.
Re: Unicode "no-op" Character?
Um... How could you be sure that process X would get the no-ops that process W wrote? After all, it's *discardable*, like you said, and the database programs and libraries aren't in on the secret. The database API functions might well strip it out, because it carries no meaning to them. Unless you can count on _certain_ programs not discarding it, and then you'd need either specialty libraries or some kind of registry or terminology for "this program does NOT strip no-ops" vs ones that do... But then they wouldn't be discardable, would they? Not by non-discarding programs. Which would have to have ways to pass them around between themselves. Moreover, as you say, what about when Process Z (or its companions) comes along and is using THE SAME MECHANISM for something utterly different? How does it know that process W wasn't writing no-ops for it, but was writing them for Process X? And of course, Z will trash them and insert its own there, and when process X comes to read it, they won't be there. You'd need to make sure that NOBODY is allowed to touch the string between *pairs* of generators and consumers of no-ops, specifically designated for each other. Yes, this is about consensual acts between responsible processes W and X, but that's exactly what the PUA is for: being assigned meaning between consenting processes. And they are not discardable by non-consenting processes, precisely because they mean something to someone. If your no-ops carry meaning, they are going to need to be preserved and passed around and not thrown away. If they carry no meaning, why are you dealing with them? Yes, PUA characters are annoying and break up grapheme clusters and stuff. But they're the only way to do what you're trying to do. ~mark On 7/3/19 11:44 AM, Sławomir Osipiuk via Unicode wrote: A process, let’s call it Process W, adds a bunch of U+000F to a string it received, or built, or a user entered via keyboard. Maybe it’s to packetize. Maybe to mark every word that is an anagram of the name of a famous 19^th -century painter, or that represents a pizza topping. Maybe something else. This is a versatile character. Process W is done adding U+000F to the string. It stores in it a database UTF-8 encoded field. Encoding isn’t a problem. The database is happy. Now Process X runs. Process X is meant to work with Process W and it’s well-aware of how U+000F is used. It reads the string from the database. It sees U+000F and interprets it. It chops the string into packets, or does a websearch for each famous painter, or it orders pizza. The private meaning of U+000F is known to both Process X and Process W. There is useful information encoded in-band, within a limited private context. But now we have Process Y. Process Y doesn’t care about packets or painters or pizza. Process Y runs outside of the private context that X and W had. Process Y translates strings into Morse code for transmission. As part of that, it replaces common words with abbreviations. Process Y doesn’t interpret U+000F. Why would it? It has no semantic value to Process Y. Process Y reads the string from the database. Internally, it clears all instances of U+000F from the string. They’re just taking up space. They’re meaningless to Y. It compiles the Morse code sequence into an audio file. But now we have Process Z. Process Z wants to take a string and mark every instance of five contiguous Latin consonants. It scrapes the database looking for text strings. It finds the string Process W created and marked. Z has no obligation to W. It’s not part of that private context. Process Z clears all instances of U+000F it finds, then inserts its own wherever it finds five-consonant clusters. It stores its results in a UTF-16LE text file. It’s allowed to do that. Nothing impossible happened here. Let’s summarize: Processes W and X established a private meaning for U+000F by agreement and interacted based on that meaning. Process Y ignored U+000F completely because it assigned no meaning to it. Process Z assigned a completely new meaning to U+000F. That’s permitted because U+000F is special and is guaranteed to have no semantics without private agreement and doesn’t need to be preserved. There is no need to escape anything. Escaping is used when a character must have more than one meaning (i.e. it is overloaded, as when it is both text and markup). U+000F only gets one meaning in any context. In a new context, the meaning gets overridden, not overloaded. That’s what makes it special. I don’t expect to see any of this in official Unicode. But I take exception to the idea that I’m suggesting something impossible. *From:*Philippe Verdy [mailto:verd...@wanadoo.fr] *Sent:* Wednesday, July 03, 2019 04:49 *To:* Sławomir Osipiuk *Cc:* unicode Unicode Discussion *Subject:* Re: Unicode "no-op" Character? Your goal is **impossible** to reach with Unicode. As
Re: Unicode "no-op" Character?
On 7/3/2019 10:47 AM, Sławomir Osipiuk via Unicode wrote: Is my idea impossible, useless, or contradictory? Not at all. What you are proposing is in the realm of higher-level protocols. You could develop such a protocol, and then write processes that honored it, or try to convince others to write processes to honor it. You could use PUA characters, or non-characters, or existing control codes -- the implications for use of any of those would be slightly different, in practice, but in any case would be an HLP. But your idea is not a feasible part of the Unicode Standard. There are no "discardable" characters in Unicode -- *by definition*. The discussion of "ignorable" characters in the standard is nuanced and complicated, because there are some characters which are carefully designed to be transparent to some, well-specified processes, but not to others. But no characters in the standard are (or can be) ignorable by *all* processes, nor can a "discardable" character ever be defined as part of the standard. The fact that there are a myriad of processes implemented (and distributed who knows where) that do 7-bit ASCII (or 8-bit 8859-1) conversion to/from UTF-16 by integral type conversion is a simple existence proof that U+000F is never, ever, ever, ever going to be defined to be "discardable" in the Unicode Standard. --Ken
Re: Unicode "no-op" Character?
On Wed, Jul 3, 2019 at 8:47 AM Sławomir Osipiuk via Unicode < unicode@unicode.org> wrote: > Security gateways filter it out completely, as a matter of best practice > and security-in-depth. > > > > A process, let’s call it Process W, adds a bunch of U+000F to a string it > received, or built, or a user entered via keyboard. ... > It stores in it a database UTF-8 encoded field... > And the database driver filters out the U+000F completely as a matter of best practice and security-in-depth. You can't say "this character should be ignored everywhere" and "this character should be preserved everywhere" at the same time. That's the contradiction.
RE: Unicode "no-op" Character?
The fact that this would require a change that is unlikely to occur is a fact I have stated repeatedly. It is pointless to tell me that. The rest of the thread, after my initial question was answered, was a thought experiment, and while I strongly disagree that such posts are “pointless” (actually, reading through the archives of this mailing list it is those ideas that have fascinated me the most and I found most engaging and enlightening) I admit I’m new here, so I will defer. Is my idea unrealistic at this point in time? Yes. I have admitted so. Is my idea impossible, useless, or contradictory? Not at all. From: Mark Davis ☕️ [mailto:m...@macchiato.com] Sent: Wednesday, July 03, 2019 13:33 To: Sławomir Osipiuk Cc: verdy_p; unicode Unicode Discussion Subject: Re: Unicode "no-op" Character? Your goal is not achievable. We can't wave a magic wand, and suddenly (or even within decades) all processes everywhere ignore U+000F in all processing will not happen. This thread is pointless and should be terminated.
Re: Unicode "no-op" Character?
Your goal is not achievable. We can't wave a magic wand, and suddenly (or even within decades) all processes everywhere ignore U+000F in all processing will not happen. This thread is pointless and should be terminated. Mark On Wed, Jul 3, 2019 at 5:48 PM Sławomir Osipiuk via Unicode < unicode@unicode.org> wrote: > I’m frustrated at how badly you seem to be missing the point. There is > nothing impossible nor self-contradictory here. There is only the matter > that Unicode requires all scalar values to be preserved during interchange. > This is in many ways a good idea, and I don’t expect it to change, but > something else would be possible if this requirement were explicitly > dropped for a well-defined small subset of characters (even just one > character). A modern-day SYN. > > > > Let’s say it’s U+000F. The standard takes my proposal and makes it a > discardable, null-displayable character. What does this mean? > > > > U+000F may appear in any text. It has no (external) semantic value. But it > may appear. It may appear a lot. > > > > Display routines (which are already dealing with combining, ligaturing, > non-/joiners, variations, initial/medial/finals forms) understand that > U+000F is to be processed as a no-op. Do nothing with this. Drop it. Move > to the next character. Simple. > > > > Security gateways filter it out completely, as a matter of best practice > and security-in-depth. > > > > A process, let’s call it Process W, adds a bunch of U+000F to a string it > received, or built, or a user entered via keyboard. Maybe it’s to > packetize. Maybe to mark every word that is an anagram of the name of a > famous 19th-century painter, or that represents a pizza topping. Maybe > something else. This is a versatile character. Process W is done adding > U+000F to the string. It stores in it a database UTF-8 encoded field. > Encoding isn’t a problem. The database is happy. > > > > Now Process X runs. Process X is meant to work with Process W and it’s > well-aware of how U+000F is used. It reads the string from the database. It > sees U+000F and interprets it. It chops the string into packets, or does a > websearch for each famous painter, or it orders pizza. The private meaning > of U+000F is known to both Process X and Process W. There is useful > information encoded in-band, within a limited private context. > > > > But now we have Process Y. Process Y doesn’t care about packets or > painters or pizza. Process Y runs outside of the private context that X and > W had. Process Y translates strings into Morse code for transmission. As > part of that, it replaces common words with abbreviations. Process Y > doesn’t interpret U+000F. Why would it? It has no semantic value to Process > Y. > > > > Process Y reads the string from the database. Internally, it clears all > instances of U+000F from the string. They’re just taking up space. They’re > meaningless to Y. It compiles the Morse code sequence into an audio file. > > > > But now we have Process Z. Process Z wants to take a string and mark every > instance of five contiguous Latin consonants. It scrapes the database > looking for text strings. It finds the string Process W created and marked. > Z has no obligation to W. It’s not part of that private context. Process Z > clears all instances of U+000F it finds, then inserts its own wherever it > finds five-consonant clusters. It stores its results in a UTF-16LE text > file. It’s allowed to do that. > > > > Nothing impossible happened here. Let’s summarize: > > > > Processes W and X established a private meaning for U+000F by agreement > and interacted based on that meaning. > > > > Process Y ignored U+000F completely because it assigned no meaning to it. > > > > Process Z assigned a completely new meaning to U+000F. That’s permitted > because U+000F is special and is guaranteed to have no semantics without > private agreement and doesn’t need to be preserved. > > > > There is no need to escape anything. Escaping is used when a character > must have more than one meaning (i.e. it is overloaded, as when it is both > text and markup). U+000F only gets one meaning in any context. In a new > context, the meaning gets overridden, not overloaded. That’s what makes it > special. > > > > I don’t expect to see any of this in official Unicode. But I take > exception to the idea that I’m suggesting something impossible. > > > > > > *From:* Philippe Verdy [mailto:verd...@wanadoo.fr] > *Sent:* Wednesday, July 03, 2019 04:49 > *To:* Sławomir Osipiuk > *Cc:* unicode Unicode Discussion > *Subject:* Re: Unicode "no-op" Character? > > > > Your goal is **impossible** to reach with Unicode. Assume sich character > is "added" to the UCS, then it can appear in the text. Your goal being that > it should be "warrantied" not to be used in any text, means that your > "character" cannot be encoded at all. >
RE: Unicode "no-op" Character?
I’m frustrated at how badly you seem to be missing the point. There is nothing impossible nor self-contradictory here. There is only the matter that Unicode requires all scalar values to be preserved during interchange. This is in many ways a good idea, and I don’t expect it to change, but something else would be possible if this requirement were explicitly dropped for a well-defined small subset of characters (even just one character). A modern-day SYN. Let’s say it’s U+000F. The standard takes my proposal and makes it a discardable, null-displayable character. What does this mean? U+000F may appear in any text. It has no (external) semantic value. But it may appear. It may appear a lot. Display routines (which are already dealing with combining, ligaturing, non-/joiners, variations, initial/medial/finals forms) understand that U+000F is to be processed as a no-op. Do nothing with this. Drop it. Move to the next character. Simple. Security gateways filter it out completely, as a matter of best practice and security-in-depth. A process, let’s call it Process W, adds a bunch of U+000F to a string it received, or built, or a user entered via keyboard. Maybe it’s to packetize. Maybe to mark every word that is an anagram of the name of a famous 19th-century painter, or that represents a pizza topping. Maybe something else. This is a versatile character. Process W is done adding U+000F to the string. It stores in it a database UTF-8 encoded field. Encoding isn’t a problem. The database is happy. Now Process X runs. Process X is meant to work with Process W and it’s well-aware of how U+000F is used. It reads the string from the database. It sees U+000F and interprets it. It chops the string into packets, or does a websearch for each famous painter, or it orders pizza. The private meaning of U+000F is known to both Process X and Process W. There is useful information encoded in-band, within a limited private context. But now we have Process Y. Process Y doesn’t care about packets or painters or pizza. Process Y runs outside of the private context that X and W had. Process Y translates strings into Morse code for transmission. As part of that, it replaces common words with abbreviations. Process Y doesn’t interpret U+000F. Why would it? It has no semantic value to Process Y. Process Y reads the string from the database. Internally, it clears all instances of U+000F from the string. They’re just taking up space. They’re meaningless to Y. It compiles the Morse code sequence into an audio file. But now we have Process Z. Process Z wants to take a string and mark every instance of five contiguous Latin consonants. It scrapes the database looking for text strings. It finds the string Process W created and marked. Z has no obligation to W. It’s not part of that private context. Process Z clears all instances of U+000F it finds, then inserts its own wherever it finds five-consonant clusters. It stores its results in a UTF-16LE text file. It’s allowed to do that. Nothing impossible happened here. Let’s summarize: Processes W and X established a private meaning for U+000F by agreement and interacted based on that meaning. Process Y ignored U+000F completely because it assigned no meaning to it. Process Z assigned a completely new meaning to U+000F. That’s permitted because U+000F is special and is guaranteed to have no semantics without private agreement and doesn’t need to be preserved. There is no need to escape anything. Escaping is used when a character must have more than one meaning (i.e. it is overloaded, as when it is both text and markup). U+000F only gets one meaning in any context. In a new context, the meaning gets overridden, not overloaded. That’s what makes it special. I don’t expect to see any of this in official Unicode. But I take exception to the idea that I’m suggesting something impossible. From: Philippe Verdy [mailto:verd...@wanadoo.fr] Sent: Wednesday, July 03, 2019 04:49 To: Sławomir Osipiuk Cc: unicode Unicode Discussion Subject: Re: Unicode "no-op" Character? Your goal is **impossible** to reach with Unicode. Assume sich character is "added" to the UCS, then it can appear in the text. Your goal being that it should be "warrantied" not to be used in any text, means that your "character" cannot be encoded at all.
Aw: Re: Unicode "no-op" Character?
A few suggestions There is a reason why the C standard library function fgetc(FILE*) returns an unsigned int instead of a char, because the constant EOF (end of file) must be outside of the definition area of a char. Some encodings like Base64 or Quoted-printable use the escape character =, but make sure that you can still encode this escape character in another way. Another possible encoding would be using a "continue" flag. For example you could use the least significant bit to signal if a stream ends or is continued, this allows you to encode 7 bits per byte and is used for arbitrary length integers or other variable length structures where terminator characters like 0x00 may be part of the data. Gesendet: Mittwoch, 03. Juli 2019 um 10:49 Uhr Von: "Philippe Verdy via Unicode" An: "Sławomir Osipiuk" Cc: "unicode Unicode Discussion" Betreff: Re: Unicode "no-op" Character? Le mer. 3 juil. 2019 à 06:09, Sławomir Osipiuk <sosip...@gmail.com> a écrit : I don’t think you understood me at all. I can packetize a string with any character that is guaranteed not to appear in the text. Your goal is **impossible** to reach with Unicode. Assume sich character is "added" to the UCS, then it can appear in the text. Your goal being that it should be "warrantied" not to be used in any text, means that your "character" cannot be encoded at all. Unicode and ISO **require** that the any proposed character can be used in text without limitation. Logivally it would be rejected becauyse your character would not be usable at all from the start. So you have no choice: you must use some transport format for your "packeting", jsut like what is used in MIME for emails, in HTTP(S) for streaming, or in internationalized domain names. For your escaping mechanism you have a very large choice already of characters considered special only for your chosen transport syntax. Your goal shows a chicken and egg problem. It is not solvable without creating self-contradictions immediately (and if you attempt to add some restriction to avoid the contradiction, then you'll fall on cases where you can no longer transport your message and your protocol will become unusable.
Re: Unicode "no-op" Character?
Also consider that C0 controls (like STX and ETX) can already be used for packetizing, but immediately comes the need for escaping (DLE has been used for that goal, jsut before the character to preserve in the stream content, notably before DLE itself, or STX and ETX). There's then no need at all of any new character in Unicode. But if your protoclol does not allow any fom of escaping, then it is broken as it cannot transport **all** valid Unicode text. Le mer. 3 juil. 2019 à 10:49, Philippe Verdy a écrit : > Le mer. 3 juil. 2019 à 06:09, Sławomir Osipiuk a > écrit : > >> I don’t think you understood me at all. I can packetize a string with any >> character that is guaranteed not to appear in the text. >> > > Your goal is **impossible** to reach with Unicode. Assume sich character > is "added" to the UCS, then it can appear in the text. Your goal being that > it should be "warrantied" not to be used in any text, means that your > "character" cannot be encoded at all. Unicode and ISO **require** that the > any proposed character can be used in text without limitation. Logivally it > would be rejected becauyse your character would not be usable at all from > the start. > > So you have no choice: you must use some transport format for your > "packeting", jsut like what is used in MIME for emails, in HTTP(S) for > streaming, or in internationalized domain names. > > For your escaping mechanism you have a very large choice already of > characters considered special only for your chosen transport syntax. > > Your goal shows a chicken and egg problem. It is not solvable without > creating self-contradictions immediately (and if you attempt to add some > restriction to avoid the contradiction, then you'll fall on cases where you > can no longer transport your message and your protocol will become unusable. >
Re: Unicode "no-op" Character?
Le mer. 3 juil. 2019 à 06:09, Sławomir Osipiuk a écrit : > I don’t think you understood me at all. I can packetize a string with any > character that is guaranteed not to appear in the text. > Your goal is **impossible** to reach with Unicode. Assume sich character is "added" to the UCS, then it can appear in the text. Your goal being that it should be "warrantied" not to be used in any text, means that your "character" cannot be encoded at all. Unicode and ISO **require** that the any proposed character can be used in text without limitation. Logivally it would be rejected becauyse your character would not be usable at all from the start. So you have no choice: you must use some transport format for your "packeting", jsut like what is used in MIME for emails, in HTTP(S) for streaming, or in internationalized domain names. For your escaping mechanism you have a very large choice already of characters considered special only for your chosen transport syntax. Your goal shows a chicken and egg problem. It is not solvable without creating self-contradictions immediately (and if you attempt to add some restriction to avoid the contradiction, then you'll fall on cases where you can no longer transport your message and your protocol will become unusable.
RE: Unicode "no-op" Character?
I don’t think you understood me at all. I can packetize a string with any character that is guaranteed not to appear in the text. Suggestions of TAB or EQUALS don’t even meet that simple criterion; they often appear in text. They require some kind of special escaping mechanism. But assume my string has a chosen character for indicating packets. But before I send it out, I want to show the string to the user. I can’t just throw it into a display method. I’d have TABs or EQUALs or UNKNOWN GLYPHs all over the place visible to the user. I don’t want that. So now I have to make a new copy of the string with my special boundary-char removed, then display that copied string. Or I could keep the original string, from before I added the packet boundaries, but that’s if I predict or assume ahead of time that I will need to display it, which in reality I might not. But that still means two copies of the string, one of which might be a waste. More code. More processing. I can do all that. But why? This thread is about a tool for convenience. I don’t “need” it, in the sense that a task is insoluble without it. I’m a programmer, I know how to code. I “want” it, because a tool like that would make some tasks much faster and simpler. Your proposed solution doesn’t. From: Philippe Verdy [mailto:verd...@wanadoo.fr] Sent: Saturday, June 29, 2019 15:47 To: Sławomir Osipiuk Cc: Shawn Steele; unicode Unicode Discussion Subject: Re: Unicode "no-op" Character? If you want to "packetize" arbitrarily long Unicode text, you don't need any new magic character. Just prepend your packet with a base character used as a syntaxic delimiter, that does not combine with what follows in any normalization.
Re: Unicode "no-op" Character?
If you want to "packetize" arbitrarily long Unicode text, you don't need any new magic character. Just prepend your packet with a base character used as a syntaxic delimiter, that does not combine with what follows in any normalization. There's a fine character for that: the TAB control. Except that during transmission it may turn into a SPACE that would combine. (the same will happen with "=" which can combine with a combining slash). But look at the normalization data (and consider that Unicode warranties that there will not be any addition of new combining pair starting by the same base character) there are LOT of suitable base characters in Unicode, which you can use as a syntaxic delimiter. Some examples (in the ASCII subset) include the hyphen-minus, the apostrophe-quote, the double quotation mark... So it's easy to split an arbitrarily long text at arbitrary character position, even in the middle of any cluster or combining sequence. It does not matter that this character may create a "cluster" with the following character, your "packetized" stream is still not readable text, but only a transport syntax (just like quoted-printable, or Base64). You can also freely choose the base character at end of each packet (the newlines are not safe as lines may be merged, but like Base64, "=" is fine to terminate each packet, as well as two ASCII quotation marks, and in fact all punctuations and symbols from ASCII (you can even use the ASCII letters and digits). If your packets have variable lengths, you may need to use escaping, or you may prepend the length (in characters or in combining sequences) of your packet before the expected terminator. All this is used in MIME for attachments in emails (with the two common transport syntaxes: Quoted Printable using escaping, or Base64 which does not require any length but requires a distinctive terminator (not used to encode the data part of the "packet") for variable length "packets". Le dim. 23 juin 2019 à 02:35, Sławomir Osipiuk via Unicode < unicode@unicode.org> a écrit : > I assure you, it wasn’t very interesting. :-) Headache-y, more like. The > diacritic thing was completely inapplicable anyway, as all our text was > plain English. I really don’t want to get into what the thing was, because > it sounds stupider the more I try to explain it. But it got the wheels > spinning in my head, and now that I’ve been reading up a lot about Unicode > and older standards like 2022/6429, it got me thinking whether there might > already be an elegant solution. > > > > But, as an example I’m making up right now, imagine you want to packetize > a large string. The packets are not all equal sized, the sizes are > determined by some algorithm. And the packet boundary may occur between a > base char and a diacritic. You insert markers into the string at the packet > boundaries. You can then store the string, copy it, display it, or pass it > to the sending function which will scan the string and know to send the > next packet when it reaches the marker. And you can now do all that without > the need to pass around extra metadata (like a list of ints of where the > packet boundaries are supposed to be) or to re-calculate the boundaries; > it’s still just a big string. If a different application sees the string, > it will know to completely ignore the packet markers; it can even strip > them out if it wants to (the canonical equivalent of the noop character is > the absence of a character). > > > > As should be obvious, I’m not recommending this as good practice. > > > > > > *From:* Shawn Steele [mailto:shawn.ste...@microsoft.com] > *Sent:* Saturday, June 22, 2019 19:57 > *To:* Sławomir Osipiuk; unicode@unicode.org > *Subject:* RE: Unicode "no-op" Character? > > > > + the list. For some reason the list’s reply header is confusing. > > > > *From:* Shawn Steele > *Sent:* Saturday, June 22, 2019 4:55 PM > *To:* Sławomir Osipiuk > *Subject:* RE: Unicode "no-op" Character? > > > > The original comment about putting it between the base character and the > combining diacritic seems peculiar. I’m having a hard time visualizing how > that kind of markup could be interesting? > > > > *From:* Unicode *On Behalf Of *Slawomir > Osipiuk via Unicode > *Sent:* Saturday, June 22, 2019 2:02 PM > *To:* unicode@unicode.org > *Subject:* RE: Unicode "no-op" Character? > > > > I see there is no such character, which I pretty much expected after > Google didn’t help. > > > > The original problem I had was solved long ago but the recent article > about watermarking reminded me of it, and my question was mostly out of > curiosity. The task wasn’t, strictly speaking, about “padding”, but about > marking – injecting “flag” characters at arbitrary points in a string > without affecting the resulting visible text. I think we ended up using > ESC, which is a dumb choice in retrospect, though the whole approach was a > bit of a hack anyway and the process it was for isn’t being used anymore. >
New control characters! (was: Re: Unicode "no-op" Character?)
All right. Thanks to everyone who offered suggestions. I think the final choice will depend on the specific application, if I ever face this puzzle again. If nothing else, this discussion has helped me formulate what exactly it is I'm imagining, which is actually a bit different that was I started with. So, just to put it out there for the internet to archive (with the likes of the various proposed "unofficial" UTFs I've been reading about), here are my two proposed control characters (why just one when you can have two at twice the price?) Implementors, feel free to jump right on this. :-) I chose to assign them to 0xE and 0xF because the use of ISO2022-style stateful shifts is expressly not permitted by ISO 10646, so by my reading the existence of those code points inside a UCS stream is a roundabout error. Therefore I'm reclaiming them for something useful. EP1 - EPHEMERAL PRIVATE SENTINEL 1 (0x0E) EP1 is executed as a null operation at the presentation layer. The formation of ligatures, the behavior of combining characters, and similar presentation mechanisms, must proceed as if EP1 were not present even when it occurs within sequences that effect such mechanisms. EP1 is intended to be used as a private process-internal sentinel or flag character. EP1 may be added at any positions in the character stream. EP1 may be removed from the stream by any receiving process that has not established an agreement for special handling of EP1. EP1 should be removed from the stream prior to any security validation. It must not interfere with the recognition of security-sensitive keywords, sequences, or credentials. EP2 - EPHEMERAL PRIVATE SENTINEL 2 (0x0F) EP2 is executed as a null operation at the presentation layer. The formation of ligatures, the behavior of combining characters, and similar presentation mechanisms, must proceed as if EP2 were not present even when it occurs within sequences that effect such mechanisms. EP2 is intended to be used as a private process-internal sentinel or flag character. EP2 may be added at any positions in the character stream. EP2 may be removed from the stream by any receiving process that has not established an agreement for special handling of EP2. EP2 should be removed from the stream prior to any security validation. It must not interfere with the recognition of security-sensitive keywords, sequences, or credentials.
Re: Unicode "no-op" Character?
On Mon, Jun 24, 2019 at 5:35 PM David Starner via Unicode < unicode@unicode.org> wrote: > On Sun, Jun 23, 2019 at 10:41 PM Shawn Steele via Unicode > wrote: > > IMO, since it's unlikely that anyone expects > that they can transmit a NUL through an arbitrary channel, unlike a > random private use character. You would be wrong. NUL is a valid codepoint like any other; except like in the C standard library and descendants. And, I expect it to be maintained. And, for the most part is, (except for emscripten) > > -- > Kie ekzistas vivo, ekzistas espero. >
Re: Unicode "no-op" Character?
On Sun, Jun 23, 2019 at 10:41 PM Shawn Steele via Unicode wrote: > Which leads us to the key. The desire is for a character that has no public > meaning, but has some sort of private meaning. In other words it has a > private use. Oddly enough, there is a group of characters intended for > private use, in the PUA ;-) Who's private use? If you have a stream of data that is being packetted for transmission, using a Private Use character is likely to mangle data that is being transmitted at some point. A NUL is likely to be the best option, IMO, since it's unlikely that anyone expects that they can transmit a NUL through an arbitrary channel, unlike a random private use character. -- Kie ekzistas vivo, ekzistas espero.
RE: Unicode "no-op" Character?
It's discardable outside of the context/process that created it. For a receiving process there is a difference between "this character has a meaning you don't understand" and "this character had a transitory meaning that has been exhausted". The first implies that it needs to be preserved and survive round-trip transmission (in fact the Unicode standard requires that). The second implies that it can be discarded. The first implies that it should be displayed to the user even if only as an "unknown something here". The second implies it should be ignored completely in display. Noncharacters have a use as internal-only sentinels, but they are difficult for an intermediate process to use if the text it receives already contains them (http://www.unicode.org/faq/private_use.html#nonchar10) and they break up combinations (they have a display effect, even if it's a subtle one). Private Use Characters are nice but they are still "part of" the text; if they are removed, the text is semantically changed. And they too display as something. I have to go back to how the SYN control character is defined. ECMA16/ISO1745 says "SYN is generally removed at the receiving Terminal Installation." It has a transitory purpose that is exhausted as soon as it is received. I wish Unicode hadn't shied away from either formalizing SYN or providing some kind of equivalent. I know it wasn't part of the scope Unicode set for itself, but I can still dream. -Original Message- From: Shawn Steele [mailto:shawn.ste...@microsoft.com] Sent: Monday, June 24, 2019 01:39 To: Sławomir Osipiuk; unicode@unicode.org Cc: 'Richard Wordingham' Subject: RE: Unicode "no-op" Character? But... it's not actually discardable. The hypothetical "packet" architecture (using the term architecture somewhat loosely) needed the information being tunneled in by this character. If it was actually discardable, then the "noop" character wouldn't be required as it would be discarded. Since the character conveys meaning to some parts of the system, then it's not actually a "noop" and it's not actually "discardable". What is actually being requested isn't a character that nobody has meaning for, but rather a character that has no PUBLIC meaning. Which leads us to the key. The desire is for a character that has no public meaning, but has some sort of private meaning. In other words it has a private use. Oddly enough, there is a group of characters intended for private use, in the PUA ;-) Of course if the PUA characters interfered with the processing of the string, they'd need to be stripped, but you're sort of already in that position by having a private flag in the middle of a string. -Shawn
RE: Unicode "no-op" Character?
But... it's not actually discardable. The hypothetical "packet" architecture (using the term architecture somewhat loosely) needed the information being tunneled in by this character. If it was actually discardable, then the "noop" character wouldn't be required as it would be discarded. Since the character conveys meaning to some parts of the system, then it's not actually a "noop" and it's not actually "discardable". What is actually being requested isn't a character that nobody has meaning for, but rather a character that has no PUBLIC meaning. Which leads us to the key. The desire is for a character that has no public meaning, but has some sort of private meaning. In other words it has a private use. Oddly enough, there is a group of characters intended for private use, in the PUA ;-) Of course if the PUA characters interfered with the processing of the string, they'd need to be stripped, but you're sort of already in that position by having a private flag in the middle of a string. -Shawn -Original Message- From: Unicode On Behalf Of Slawomir Osipiuk via Unicode Sent: Saturday, June 22, 2019 6:10 PM To: unicode@unicode.org Cc: 'Richard Wordingham' Subject: RE: Unicode "no-op" Character? That's the key to the no-op idea. The no-op character could not ever be assumed to survive interchange with another process. It'd be canonically equivalent to the absence of character. It could be added or removed at any position by a Unicode-conformant process. A program could wipe all the no-ops from a string it has received, and insert its own for its own purposes. (In fact, it should wipe the old ones so as not to confuse itself.) It's "another process's discardable junk" unless known, internally-only, to be meaningful at a particular stage. While all the various (non)joiners/ignorables are interesting, none of them have this property. In fact, that might be the best description: It's not just an "ignorable", it's a "discardable". Unicode doesn't have that, does it? -Original Message- From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Richard Wordingham via Unicode Sent: Saturday, June 22, 2019 20:59 To: unicode@unicode.org Cc: Shawn Steele Subject: Re: Unicode "no-op" Character? If they're conveying an invisible message, one would have to strip out original ZWNBSP/WJ/ZWSP that didn't affect line-breaking. The weak point is that that assumes that line-break opportunities are well-defined. For example, they aren't for SE Asian text. Richard.
RE: Unicode "no-op" Character?
Ah, sorry. I meant to say that the string should always be normalized (not "sanitized") before being checked for exploits (i.e. sanitized). -Original Message- From: Sławomir Osipiuk [mailto:sosip...@gmail.com] Sent: Sunday, June 23, 2019 20:28 To: unicode@unicode.org Cc: 'Richard Wordingham' Subject: RE: Unicode "no-op" Character? The string should always be sanitized before being checked for exploits
RE: Unicode "no-op" Character?
On the subject of security, I've read through: https://www.unicode.org/reports/tr36/#Deletion_of_Noncharacters which says: "The issue is the following: A gateway might be checking for a sensitive sequence of characters, say "delete". If what is passed in is "deXlete", where X is a noncharacter, the gateway lets it through: the sequence "deXlete" may be in and of itself harmless. However, suppose that later on, past the gateway, an internal process invisibly deletes the X. In that case, the sensitive sequence of characters is formed, and can lead to a security breach." Checking a string for a sequence of characters, then passing the string to a different function which (potentially) modifies it, then using it in a context where the security checks mater, just screams bad practice to me. There should be no modification permitted between a security check and security-sensitive use. The string should always be sanitized before being checked for exploits. Any function which modifies the characters in any way (and is not itself security-aware) should implicitly mark the string as unsafe again. Or am I off base? Security is not really my specialty, but the approach described in the TR stinks horribly to me. And in my idea, noops would be stripped as part of string sanitization. But the more I consider it, the more I understand such a thing would have had to have be built into Unicode at the earliest stages. Basically, it's too late now. -Original Message- From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Richard Wordingham via Unicode Sent: Sunday, June 23, 2019 04:37 To: unicode@unicode.org Subject: Re: Unicode "no-op" Character? Discardables are a security risk, as security filters may find it hard to take them into account. Richard.
Re: Unicode "no-op" Character?
On Sat, 22 Jun 2019 21:10:08 -0400 Sławomir Osipiuk via Unicode wrote: > In fact, that might be the best description: It's not just an > "ignorable", it's a "discardable". Unicode doesn't have that, does it? No, though the byte order mark at the start of a file comes close. Discardables are a security risk, as security filters may find it hard to take them into account. Richard.
Re: Unicode "no-op" Character?
On Sat, 22 Jun 2019 23:56:50 + Shawn Steele via Unicode wrote: > + the list. For some reason the list's reply header is confusing. > > From: Shawn Steele > Sent: Saturday, June 22, 2019 4:55 PM > To: Sławomir Osipiuk > Subject: RE: Unicode "no-op" Character? > > The original comment about putting it between the base character and > the combining diacritic seems peculiar. I'm having a hard time > visualizing how that kind of markup could be interesting? There are a number of possible interesting scenarios: 1) Chopping the string into user perceived characters. For example, the Khmer sequences of COENG plus letter are named sequences. Akin to this is identifying resting places for a simple cursor, e.g. allowing it to be positioned between a base character and a spacing, unreordered subscript. (This last possibility overlaps with rendering.) 2) Chopping the string into collating elements. (This can require renormalisation, and may raise a rendering issue with HarfBuzz, where renomalisation is required to get marks into a suitable order for shaping. I suspect no-op characters would disrupt this renormalisation; CGJ may legitimately be used to affect rendering this way, even though it is supposed to have no other effect* on rendering.) 3) Chopping the string into default grapheme clusters. That separates a coeng from the following character with which it interacts. *Is a Unicode-compliant *renderer* allowed to distinguish diaeresis from the umlaut mark? Richard.
RE: Unicode "no-op" Character?
That's the key to the no-op idea. The no-op character could not ever be assumed to survive interchange with another process. It'd be canonically equivalent to the absence of character. It could be added or removed at any position by a Unicode-conformant process. A program could wipe all the no-ops from a string it has received, and insert its own for its own purposes. (In fact, it should wipe the old ones so as not to confuse itself.) It's "another process's discardable junk" unless known, internally-only, to be meaningful at a particular stage. While all the various (non)joiners/ignorables are interesting, none of them have this property. In fact, that might be the best description: It's not just an "ignorable", it's a "discardable". Unicode doesn't have that, does it? -Original Message- From: Unicode [mailto:unicode-boun...@unicode.org] On Behalf Of Richard Wordingham via Unicode Sent: Saturday, June 22, 2019 20:59 To: unicode@unicode.org Cc: Shawn Steele Subject: Re: Unicode "no-op" Character? If they're conveying an invisible message, one would have to strip out original ZWNBSP/WJ/ZWSP that didn't affect line-breaking. The weak point is that that assumes that line-break opportunities are well-defined. For example, they aren't for SE Asian text. Richard.
Re: Unicode "no-op" Character?
On Sat, 22 Jun 2019 23:56:11 + Shawn Steele via Unicode wrote: > Assuming you were using any of those characters as "markup", how > would you know when they were intentionally in the string and not > part of your marking system? If they're conveying an invisible message, one would have to strip out original ZWNBSP/WJ/ZWSP that didn't affect line-breaking. The weak point is that that assumes that line-break opportunities are well-defined. For example, they aren't for SE Asian text. Richard.
RE: Unicode "no-op" Character?
I assure you, it wasn't very interesting. :-) Headache-y, more like. The diacritic thing was completely inapplicable anyway, as all our text was plain English. I really don't want to get into what the thing was, because it sounds stupider the more I try to explain it. But it got the wheels spinning in my head, and now that I've been reading up a lot about Unicode and older standards like 2022/6429, it got me thinking whether there might already be an elegant solution. But, as an example I'm making up right now, imagine you want to packetize a large string. The packets are not all equal sized, the sizes are determined by some algorithm. And the packet boundary may occur between a base char and a diacritic. You insert markers into the string at the packet boundaries. You can then store the string, copy it, display it, or pass it to the sending function which will scan the string and know to send the next packet when it reaches the marker. And you can now do all that without the need to pass around extra metadata (like a list of ints of where the packet boundaries are supposed to be) or to re-calculate the boundaries; it's still just a big string. If a different application sees the string, it will know to completely ignore the packet markers; it can even strip them out if it wants to (the canonical equivalent of the noop character is the absence of a character). As should be obvious, I'm not recommending this as good practice. From: Shawn Steele [mailto:shawn.ste...@microsoft.com] Sent: Saturday, June 22, 2019 19:57 To: Sławomir Osipiuk; unicode@unicode.org Subject: RE: Unicode "no-op" Character? + the list. For some reason the list's reply header is confusing. From: Shawn Steele Sent: Saturday, June 22, 2019 4:55 PM To: Sławomir Osipiuk Subject: RE: Unicode "no-op" Character? The original comment about putting it between the base character and the combining diacritic seems peculiar. I'm having a hard time visualizing how that kind of markup could be interesting? From: Unicode On Behalf Of Slawomir Osipiuk via Unicode Sent: Saturday, June 22, 2019 2:02 PM To: unicode@unicode.org Subject: RE: Unicode "no-op" Character? I see there is no such character, which I pretty much expected after Google didn't help. The original problem I had was solved long ago but the recent article about watermarking reminded me of it, and my question was mostly out of curiosity. The task wasn't, strictly speaking, about "padding", but about marking - injecting "flag" characters at arbitrary points in a string without affecting the resulting visible text. I think we ended up using ESC, which is a dumb choice in retrospect, though the whole approach was a bit of a hack anyway and the process it was for isn't being used anymore.
RE: Unicode "no-op" Character?
Assuming you were using any of those characters as "markup", how would you know when they were intentionally in the string and not part of your marking system? -Original Message- From: Unicode On Behalf Of Richard Wordingham via Unicode Sent: Saturday, June 22, 2019 4:17 PM To: unicode@unicode.org Subject: Re: Unicode "no-op" Character? On Sat, 22 Jun 2019 17:50:49 -0400 Sławomir Osipiuk via Unicode wrote: > If faced with the same problem today, I’d probably just go with U+FEFF > (really only need a single char, not a whole delimited substring) or a > different C0 control (maybe SI/LS0) and clean up the string if it > needs to be presented to the user. You'd really want an intelligent choice between U+FEFF (ZWNBSP) (better U+2060 WJ) and U+200B (ZWSP). > I still think an “idle”/“null tag”/“noop” character would be a neat > addition to Unicode, but I doubt I can make a convincing enough case > for it. You'd still only be able to insert it between characters, not between code units, unless you were using UTF-32. Richard.
RE: Unicode "no-op" Character?
+ the list. For some reason the list's reply header is confusing. From: Shawn Steele Sent: Saturday, June 22, 2019 4:55 PM To: Sławomir Osipiuk Subject: RE: Unicode "no-op" Character? The original comment about putting it between the base character and the combining diacritic seems peculiar. I'm having a hard time visualizing how that kind of markup could be interesting? From: Unicode mailto:unicode-boun...@unicode.org>> On Behalf Of Slawomir Osipiuk via Unicode Sent: Saturday, June 22, 2019 2:02 PM To: unicode@unicode.org<mailto:unicode@unicode.org> Subject: RE: Unicode "no-op" Character? I see there is no such character, which I pretty much expected after Google didn't help. The original problem I had was solved long ago but the recent article about watermarking reminded me of it, and my question was mostly out of curiosity. The task wasn't, strictly speaking, about "padding", but about marking - injecting "flag" characters at arbitrary points in a string without affecting the resulting visible text. I think we ended up using ESC, which is a dumb choice in retrospect, though the whole approach was a bit of a hack anyway and the process it was for isn't being used anymore.
Re: Unicode "no-op" Character?
On Sat, 22 Jun 2019 17:50:49 -0400 Sławomir Osipiuk via Unicode wrote: > If faced with the same problem today, I’d > probably just go with U+FEFF (really only need a single char, not a > whole delimited substring) or a different C0 control (maybe SI/LS0) > and clean up the string if it needs to be presented to the user. You'd really want an intelligent choice between U+FEFF (ZWNBSP) (better U+2060 WJ) and U+200B (ZWSP). > I still think an “idle”/“null tag”/“noop” character would be a neat > addition to Unicode, but I doubt I can make a convincing enough case > for it. You'd still only be able to insert it between characters, not between code units, unless you were using UTF-32. Richard.
RE: Unicode "no-op" Character?
Indeed. There are plenty of control characters that seem useful, but they really aren’t, due to lack of support from common software. Unicode is deliberately silent about most of them, which is fair, but not always convenient. If faced with the same problem today, I’d probably just go with U+FEFF (really only need a single char, not a whole delimited substring) or a different C0 control (maybe SI/LS0) and clean up the string if it needs to be presented to the user. I still think an “idle”/“null tag”/“noop” character would be a neat addition to Unicode, but I doubt I can make a convincing enough case for it. From: J Decker [mailto:d3c...@gmail.com] Sent: Saturday, June 22, 2019 17:19 To: Sławomir Osipiuk Cc: Unicode Discussion Subject: Re: Unicode "no-op" Character? But it doesn't appear anything actually 'supports' that.
Re: Unicode "no-op" Character?
On Sat, Jun 22, 2019 at 2:04 PM Sławomir Osipiuk via Unicode < unicode@unicode.org> wrote: > I see there is no such character, which I pretty much expected after > Google didn’t help. > > > > The original problem I had was solved long ago but the recent article > about watermarking reminded me of it, and my question was mostly out of > curiosity. The task wasn’t, strictly speaking, about “padding”, but about > marking – injecting “flag” characters at arbitrary points in a string > without affecting the resulting visible text. I think we ended up using > ESC, which is a dumb choice in retrospect, though the whole approach was a > bit of a hack anyway and the process it was for isn’t being used anymore. > The spec would suggest that there are escape codes like that, which can be used. APC, U+009F ST, String Terminator, U+009C which is supposed to be a sequence of characters that should not be displayed, but may be used to control the application displaying them. (assuming they understand them) https://www.aivosto.com/articles/control-characters.html 156$9CSTString Terminator 234 9/12 ST ESC \ Closes a string opened by APC, DCS, OSC, PM or SOS. 159$9FAPCApplication Program Command 237 9/15 AC ESC _ Starts an application program command string. ST will end the command. The interpretation of the command is subject to the program in question. But it doesn't appear anything actually 'supports' that.
RE: Unicode "no-op" Character?
I see there is no such character, which I pretty much expected after Google didn't help. The original problem I had was solved long ago but the recent article about watermarking reminded me of it, and my question was mostly out of curiosity. The task wasn't, strictly speaking, about "padding", but about marking - injecting "flag" characters at arbitrary points in a string without affecting the resulting visible text. I think we ended up using ESC, which is a dumb choice in retrospect, though the whole approach was a bit of a hack anyway and the process it was for isn't being used anymore.
RE: Unicode "no-op" Character?
Sławomir Osipiuk wrote: > Does Unicode include a character that does nothing at all? I'm talking > about something that can be used for padding data without affecting > interpretation of other characters, including combining chars and > ligatures. I join Shawn Steele in wondering what your "data padding" use case is for such a character. Most modern protocols don't require string fields to be exactly N characters long, or have their own mechanism for storing the real string length and ignoring any padding characters. If you just need to fill up space at the end of a line, and need a character that has as little disruptive meaning as possible, I agree that U+FEFF is probably the closest you'll get. NULL, of course, was intended to serve exactly this purpose, but everyone has decided for themselves what the C0 code points should be used for, and "display a .notdef glyph" is one of the popular choices. -- Doug Ewell | Thornton, CO, US | ewellic.org
Re: Unicode "no-op" Character?
Perhaps a codepoint from a private use area and another processing step to add/ remove them would work for you? On Sat, Jun 22, 2019, 1:39 AM Mark Davis ☕️ via Unicode wrote: > There nothing like what you are describing. Examples: > >1. Display — There are a few of the Default Ignorables that are always >treated as invisible, and have little effect on other characters. However, >even those will generally interfere with the display of sequences (be >between 'q' and U+0308 ( q̈ ); within emoji sequences, within >ligatures, etc), line break, etc. >2. Interpretation — There is no character that would always be ignored >by all processes. Some processes may ignore some characters (eg a search >indexer may ignore most default ignorables), but there is nothing that all >processes will ignore. > > The only exception would be if some cooperating processes that had agreed > beforehand to strip some particular character. > > Mark > > > On Sat, Jun 22, 2019 at 6:49 AM Sławomir Osipiuk via Unicode < > unicode@unicode.org> wrote: > >> Does Unicode include a character that does nothing at all? I’m talking >> about something that can be used for padding data without affecting >> interpretation of other characters, including combining chars and >> ligatures. I.e. a character that could hypothetically be inserted between a >> latin E and a combining acute and still produce É. The historical >> description of U+0016 SYNCHRONOUS IDLE seems like pretty much exactly what >> I want. It only has one slight disadvantage: it doesn’t work. All software >> I’ve tried displays it as an unknown character and it definitely breaks up >> combinations. And U+ NULL seems even worse. >> >> >> >> I can imagine the answer is that this thing I’m looking for isn’t a >> character at all and so should be the business of “a higher-level protocol” >> and not what Unicode was made for… but Unicode does include some odd things >> so I wonder if there is something like that regardless. Can anyone offer >> any suggestions? >> >> >> >> Sławomir Osipiuk >> >
Re: Unicode "no-op" Character?
There nothing like what you are describing. Examples: 1. Display — There are a few of the Default Ignorables that are always treated as invisible, and have little effect on other characters. However, even those will generally interfere with the display of sequences (be between 'q' and U+0308 ( q̈ ); within emoji sequences, within ligatures, etc), line break, etc. 2. Interpretation — There is no character that would always be ignored by all processes. Some processes may ignore some characters (eg a search indexer may ignore most default ignorables), but there is nothing that all processes will ignore. The only exception would be if some cooperating processes that had agreed beforehand to strip some particular character. Mark On Sat, Jun 22, 2019 at 6:49 AM Sławomir Osipiuk via Unicode < unicode@unicode.org> wrote: > Does Unicode include a character that does nothing at all? I’m talking > about something that can be used for padding data without affecting > interpretation of other characters, including combining chars and > ligatures. I.e. a character that could hypothetically be inserted between a > latin E and a combining acute and still produce É. The historical > description of U+0016 SYNCHRONOUS IDLE seems like pretty much exactly what > I want. It only has one slight disadvantage: it doesn’t work. All software > I’ve tried displays it as an unknown character and it definitely breaks up > combinations. And U+ NULL seems even worse. > > > > I can imagine the answer is that this thing I’m looking for isn’t a > character at all and so should be the business of “a higher-level protocol” > and not what Unicode was made for… but Unicode does include some odd things > so I wonder if there is something like that regardless. Can anyone offer > any suggestions? > > > > Sławomir Osipiuk >
Re: Unicode "no-op" Character?
Op zaterdag 22 juni 2019 02:14 schreef Sławomir Osipiuk via Unicode: Does Unicode include a character that does nothing at all? I'm talking about something that can be used for padding data without affecting interpretation of other characters, including combining chars and ligatures. I.e. a character that could hypothetically be inserted between a latin E and a combining acute and still produce É. The historical description of U+0016 SYNCHRONOUS IDLE seems like pretty much exactly what I want. It only has one slight disadvantage: it doesn't work. All software I've tried displays it as an unknown character and it definitely breaks up combinations. And U+ NULL seems even worse. DEL was used as such on papertape to replace errors. Alex.
Re: Unicode "no-op" Character?
Sounds like a great use for ZWNBSP (zero width non-breaking space) 0xFEFF (Also used as BOM) or that doesn't break; maybe 'ZERO WIDTH SPACE' (U+200B) On Fri, Jun 21, 2019 at 9:48 PM Sławomir Osipiuk via Unicode < unicode@unicode.org> wrote: > Does Unicode include a character that does nothing at all? I’m talking > about something that can be used for padding data without affecting > interpretation of other characters, including combining chars and > ligatures. I.e. a character that could hypothetically be inserted between a > latin E and a combining acute and still produce É. The historical > description of U+0016 SYNCHRONOUS IDLE seems like pretty much exactly what > I want. It only has one slight disadvantage: it doesn’t work. All software > I’ve tried displays it as an unknown character and it definitely breaks up > combinations. And U+ NULL seems even worse. > > > > I can imagine the answer is that this thing I’m looking for isn’t a > character at all and so should be the business of “a higher-level protocol” > and not what Unicode was made for… but Unicode does include some odd things > so I wonder if there is something like that regardless. Can anyone offer > any suggestions? > > > > Sławomir Osipiuk >
RE: Unicode "no-op" Character?
I'm curious what you'd use it for? From: Unicode On Behalf Of Slawomir Osipiuk via Unicode Sent: Friday, June 21, 2019 5:14 PM To: unicode@unicode.org Subject: Unicode "no-op" Character? Does Unicode include a character that does nothing at all? I'm talking about something that can be used for padding data without affecting interpretation of other characters, including combining chars and ligatures. I.e. a character that could hypothetically be inserted between a latin E and a combining acute and still produce É. The historical description of U+0016 SYNCHRONOUS IDLE seems like pretty much exactly what I want. It only has one slight disadvantage: it doesn't work. All software I've tried displays it as an unknown character and it definitely breaks up combinations. And U+ NULL seems even worse. I can imagine the answer is that this thing I'm looking for isn't a character at all and so should be the business of "a higher-level protocol" and not what Unicode was made for... but Unicode does include some odd things so I wonder if there is something like that regardless. Can anyone offer any suggestions? Sławomir Osipiuk