Re: Another take on the English apostrophe in Unicode
On Thu, Jun 4, 2015 at 9:25 PM, David Starner prosfil...@gmail.com wrote: Hyphens generally make multiple words into one anyway. There's not really multiple hyphens the way there's separate quotes and apostrophes. Generally, but not always, just as apostrophes aren't always at a contracted word boundary. There is only one hyphen because no language (AFAIK) claims it as part of its alphabet. Leo On 7:01pm, Thu, Jun 4, 2015 Leo Broukhis l...@mailcom.com wrote: Along the same lines, we might need a MODIFIER LETTER HYPHEN, because, for example, the work ack-ack isn't decomposable into words, or even morphemes, ack and ack. Leo On Thu, Jun 4, 2015 at 6:31 PM, David Starner prosfil...@gmail.com wrote: On Thu, Jun 4, 2015 at 2:38 PM Markus Scherer markus@gmail.com wrote: don’t is a contraction of two words, it is not one word. But as he points out, it's not a contraction of don and t; it is, at best, a contraction of do and n't. It's eliding, not punctuating. In the comments, he also brings up the examples of Don’t you mind? being okay but not *Do not you mind?, and fo’c’sle. You can't use simple regular expressions to find word boundaries. Who uses _simple_ regular expressions? You can't use any code to reliably find word boundaries in English, and that's a problem.
Re: Another take on the English apostrophe in Unicode
But the point was that treating hyphens as parts of words is not generally a wrong thing. That brings us back to my original question: where's MODIFIER LETTER HYPHEN, then? A word is a sequence of letters, isn't it? :) I agree that conflating apostrophes and quotes is a source of problems, however, existence of the MODIFIER LETTER [same glyph as used for English contractions] in Unicode is a coincidence which should not have an effect on usage of apostrophes in English. Leo On Thu, Jun 4, 2015 at 11:58 PM, David Starner prosfil...@gmail.com wrote: On June 4, 2015, at 11:01 PM, Leo Broukhis l...@mailcom.com wrote: On Thu, Jun 4, 2015 at 9:25 PM, David Starner prosfil...@gmail.com wrote: Hyphens generally make multiple words into one anyway. There's not really multiple hyphens the way there's separate quotes and apostrophes. Generally, but not always, just as apostrophes aren't always at a contracted word boundary. There is only one hyphen because no language (AFAIK) claims it as part of its alphabet. But the point was that treating hyphens as parts of words is not generally a wrong thing. There is one generally consistent rule for hyphens. When apostrophes and quotes are conflated, there is no one generally acceptable rule.
Re: Another take on the English apostrophe in Unicode
On June 4, 2015, at 11:01 PM, Leo Broukhis l...@mailcom.com wrote: On Thu, Jun 4, 2015 at 9:25 PM, David Starner prosfil...@gmail.com wrote: Hyphens generally make multiple words into one anyway. There's not really multiple hyphens the way there's separate quotes and apostrophes. Generally, but not always, just as apostrophes aren't always at a contracted word boundary. There is only one hyphen because no language (AFAIK) claims it as part of its alphabet. But the point was that treating hyphens as parts of words is not generally a wrong thing. There is one generally consistent rule for hyphens. When apostrophes and quotes are conflated, there is no one generally acceptable rule.
Re: Another take on the English apostrophe in Unicode
The conflict is between linguists and programmers. In plain text apostrophe is a punctuation used instead letters (unreadable, one or more) or as separator for avoid connecting letters into ligature or syllable, between parts of composite word as well as inside the simple word, or finally, as quotation mark. Yes it is ambiguous! It is. It just is! Linguists say It is. We see that. We know that. And programmers say That's wrong! We can't understand that. Just are you so stupid if you can't! Modifier letter apostrophe is a letter that used as itself and means itself (ejective sound e.g.) only. Don't use it else. It just make more confusion.
Re: Another take on the English apostrophe in Unicode
Markus Scherer wrote: How are normal users supposed to find both U+2019 and U+02BC on their keyboards, and how are they supposed to deal with incorrect usage? Would it be possible to have wordprocessing software where one uses CONTROL APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC for input and could there be a show in colour mode where U+2019 is displayed in cyan and U+02BC is displayed in red, while everything else is displayed in black? That is, CONTROL U+0027 and CONTROL SHIFT U+0027 respectively. If people want this facility, maybe it could become published in a Unicode Technical Report so that standardization and interoperability could be achieved. William Overington 5 June 2015
Re: Tag characters and in-line graphics (from Tag characters)
On 6/4/2015 17:03 , Chris wrote: This whole discussion is about the fact that it would be technically possible to have private character sets and private agreements that your OS downloads without the user being aware of it. The sticky issues are not the questions of how to make available fonts or images for use by the OS. Instead, they concern the fact that any such a model violates some pretty basic guarantees of plain text that the entire net infrastructure relies on. There are very obvious security issues. The start with tracking; every time you access a custom code point, that fact potentially results in a trackable interaction. This problem affects even the sticker solution that people are hoping for for emoji. (On my system, no external resources are displayed when I first open any message, and there is a reason for that). Beyond tracking, and beyond stickers (that is pictures that look like pictures) a generalized custom character set would allow text that is no longer really stable. You would be able to deliver identical e-mails to people that display differently, because when you serve the custom fonts, you would be able to customize what you deliver under the same custom character set designator. While this would be a wonderful way to circumvent censorship (other than the man in the middle version), you would likewise seriously undermine the ability to filter unwanted or undesirable texts, because the custom character set engine might recognize when a request comes from a filter and not the end user. (Just the other day, I came across a hacked website that responded differently to search engined than to live users, making the hack effective for one and invisible to the other. Custom character sets would seem to just add to the hackers' arsenal here). Finally, custom character sets sound like a great idea when thinking of an extension of an existing character set. But that's not where the issues are. The issues come in when you use the same technology to provide aliases for existing code points or for other custom characters. Aliasing undermines the ability to do search (or any other content-focused processing, from sorting to spell-check). At that point, the circle closes. When Unicode was created, the alternative then was ISO 2022, which was a standard that addressed the issue of how to switch among (albeit pre-defined) character sets to achieve, in principle, coverage equal to the union of these character sets. Unicode was created to address two main deficiencies of that situation. Unification addressed the aliasing issue, so that code points were no longer opaque but could be interpreted by software (other than display), which was the second big drawback of the patchwork of character sets. A processing model for opaque code points is possible to define, but it isn't very practical and in the late eighties people had had enough were glad to be quit of it. Seen from this perspective, the discussion about custom character sets presents itself as a giant step backward, undermining the very advances that underlie the rapid acceptance and spread of Unicode. A./
Re: Tag characters and in-line graphics (from Tag characters)
On 2015/06/04 17:03, Chris wrote: I wish Steve Jobs was here to give this lecture. Well, if Steve Jobs were still around, he could think about whether (and how many) users really want their private characters, and whether it was worth the time to have his engineers working on the solution. I'm not sure he would come to the same conclusion as you. This whole discussion is about the fact that it would be technically possible to have private character sets and private agreements that your OS downloads without the user being aware of it. Now if the unicode consortium were to decide on standardising a technological process whereby rendering engines could seamlessly download representations of custom characters without user intervention, no doubt all the vendors would support it, and all the technical mumbo jumbo of installing privately agreed character sets would be something users could leave for the technology to sort out. You are right that it would be strictly technically possible. Not only that, it has been so for 10 or 20 years. As an example, in 1996 at the WWW Conference in Paris I was participating in a workshop on internationalization for the Web, and by chance I was sitting between the participant from Adobe and the participant from Microsoft. These were the main companies working on font technology at that time, and I asked them how small it would be possible to make a font for a single character using their technologies (the purpose of such a font, as people on this thread should be able to guess, would be as part of a solution to exchange single, user-defined characters). I don't even remember their answers. The important thing here that the idea, and the technology, have been around for a long time. So why didn't it take on? Maybe the demand is just not as big as some contributors on this list claim. Also, maybe while the technology itself isn't rocket science, the responsible people at the relevant companies have enough experience with technology deployment to hold back. To give an example of why the deployment aspect is important, there were various Web-like hypertext technologies around when the Web took off in the 1990. One of them was called HyperG. It was technologically 'better' than the Web, in that it avoided broken links. But it was much more difficult to deploy, and so it is forgotten, whereas the Web took off. Regards, Martin.
Re: Tag characters and in-line graphics (from Tag characters)
Asmus Freytag wrote about security issues. This is interesting reading and I have learned a lot from the post about various security issues. Whilst the post is in this thread and follows from a post in this thread, the topic has seemed to moved to the Custom characters thread. I note that what you write about seems to me that it would not apply to my suggestion in my original post: is that correct? http://www.unicode.org/mail-arch/unicode-ml/y2015-m05/0218.html Also the following two posts. http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0009.html http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0027.html Whilst the ideas raised by Chris are interesting, they do seem to be distinctly different from what I suggested. So, for clarity, do you regard my suggested format as having any security issues, and if so, what please? I know that some people have opined that my suggested format is out of scope for Unicode, yet the scope of Unicode is what the Unicode Technical Committee decides is the scope of Unicode, and my suggested format does provide a way to include custom glyphs within a Unicode plain text document by using the new base character followed by tag characters method. William Overington 5 June 2015
Re: Another take on the English apostrophe in Unicode
On Fri, Jun 5, 2015 at 12:16 AM Leo Broukhis l...@mailcom.com wrote: I agree that conflating apostrophes and quotes is a source of problems, however, existence of the MODIFIER LETTER [same glyph as used for English contractions] in Unicode is a coincidence which should not have an effect on usage of apostrophes in English. Coincidence or not, the Unicode Consortium is not going to allocate a new code-point for the English apostrophe as long as MODIFIER LETTER APOSTROPHE exists. Any change is pretty unlikely, but changing to an existing character is vastly more likely then creating a new one.
Re: Another take on the English apostrophe in Unicode
On Fri, Jun 5, 2015 at 2:43 AM QSJN 4 UKR qsjn4...@gmail.com wrote: The conflict is between linguists and programmers. No, it's not. Yes it is ambiguous! It is. It just is! Linguists say It is. We see that. We know that. Now you programmers find some way to deal with that so you can produce useful corpuses for linguistic work. Which is what this is all about, is producing good linguistic interpretations of plain text, for, among others, linguists whose supply of scanned text has exceeded their ability to hand-process it. Modifier letter apostrophe is a letter that used as itself and means itself (ejective sound e.g.) only. Don't use it else. It just make more confusion. If you don't know what language a text is in, you can't tell what sounds letters make. Adding this character to English's repertoire won't change that.
Re: Another take on the English apostrophe in Unicode
I don’t have a particular position staked out. But to this discussion should be added the very interesting work done by Zwicky and Pullum arguing that the apostrophe is the 27th letter of the Latin alphabet. Neither U+2019 nor U+02BC would satisfy that position. See: Zwicky and Pullum 1983 Zwicky, Arnold M., and Geoffrey K. Pullum. Cliticization vs. Inflection: English N’T.Language59, no. 3 (1983): 502–513. It’s nicely summarized and discussed here: http://chronicle.com/blogs/linguafranca/2013/03/22/being-an-apostrophe/ jk -- Joel Kalvesmaki Editor in Byzantine Studies Dumbarton Oaks 202 339 6435
Re: Another take on the English apostrophe in Unicode
Markus Scherer wrote: How are normal users supposed to find both U+2019 and U+02BC on their keyboards, and how are they supposed to deal with incorrect usage? I replied: Would it be possible to have wordprocessing software where one uses CONTROL APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC for input and could there be a show in colour mode where U+2019 is displayed in cyan and U+02BC is displayed in red, while everything else is displayed in black? I am wondering whether some existing software packages might be able to be used for the character inputting part using customized keyboard short cuts. https://community.serif.com/forum/43862/question-about-customized-keyboard-short-cuts I realize that the cyan and red colours cannot be done at present, yet I have now thought of the alternative for now of being able to test what is in the text by using a special version of an open source font where there are distinctive glyphs one from the other for the two characters. William Overington 5 June 2015
Re: ucd beta, stable filenames
Le vendredi, 5 juin 2015 à 16:48, Daniel Bünzli a écrit : and/or simply publish it in the version directory but without the suffixes (like the ucdxml files do). Or both with and without the suffix of course. Daniel
Re: Another take on the English apostrophe in Unicode
On Jun 4, 2015, at 17:34 , Markus Scherer markus@gmail.com wrote: Looks all wrong to me. don’t is a contraction of two words, it is not one word. Yes it is. Is keyboard two words? How about newspaper? If don't is two words, please tell me what two words make up won't? (Hint, neither of them is will.) Linguistically, don't and friends pass all the diagnostics that indicate they're single words. - John Burger English is taught as that squiggle being punctuation, not a letter. (Unlike, say, the Hawaiʻian ʻOkina.) You can't use simple regular expressions to find word boundaries. That's why we have UAX #29. Confusion between apostrophe and quoting -- blame the scribe who came up with the ambiguous use, not the people who gave it a number. If anything, Unicode might have made a mistake in encoding two of these that look identical. How are normal users supposed to find both U+2019 and U+02BC on their keyboards, and how are they supposed to deal with incorrect usage? markus
ucd beta, stable filenames
Hello, Would it be possible in the future to publish the latest version of the ucd files without the -X.Y.ZdW suffixes under a fixed URI like http://www.unicode.org/Public/beta/ and/or simply publish it in the version directory but without the suffixes (like the ucdxml files do). With the current scheme it hard for implementers to automate file downloads for testing with the beta. Thanks, Daniel
Re: Another take on the English apostrophe in Unicode
QSJN 4 UKR qsjn4ukr at gmail dot com wrote: And programmers say That's wrong! We can't understand that. Just are you so stupid if you can't! You know, we really aren't all like that. Some of us actually try to meet user needs. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: Tag characters and in-line graphics (from Tag characters)
I wrote, crumpled up, and threw away about three different responses. I thought about ISO 2022 and about accessing the web for every PUA character, as Asmus mentioned, and about the size of the user base, as Martin mentioned. I thought about character properties and about ephemerality. I didn't think of the spoofing implications that Asmus described, which would affect both the automatic PUA font download and the inline drawing language. Either of these could be used to spell out, let's say, paypal.com rather convincingly and with minimal effort. I might have more experience with the PUA than many list members, having transcribed the 27,000-word Alice's Adventures in Wonderland into my constructed alphabet two years ago, in a PUA encoding, so that Michael Everson could publish it in book form. One of the many learning experiences of this project was finding out which software tools play nicely with the PUA and which don't. Some tools just worked while others would not give acceptable results with any amount of effort. At no point, however, did I suppose that a font with my alphabet, or any of the jillions of others that have been invented during a boring day in class (see Omniglot for tons of examples), should be silently downloaded to a user's computer, consuming bandwidth and disk space, without her knowledge. That's practically malware. Maybe I'm just not enough of a Distinguished Visionary to understand how insanely great this would be (unfortunately, celebrity name-dropping doesn't work with me). Unicode has stated consistently for at least 23 years that it would not ever standardize PUA usage, and over the years some UTC members have used terms like strongly discouraged and not interoperable even in the presence of an agreement. Given this, and given that no system I'm aware of magically downloads fonts for *regularly encoded characters* (I still have no font for Arabic math symbols), I personally would not expect Unicode to perform a 180 on this. -- Doug Ewell | http://ewellic.org | Thornton, CO
Re: ucd beta, stable filenames
On 6/5/2015 8:48 AM, Daniel Bünzli wrote: Hello, Would it be possible in the future to publish the latest version of the ucd files without the -X.Y.ZdW suffixes under a fixed URI like http://www.unicode.org/Public/beta/ and/or simply publish it in the version directory but without the suffixes (like the ucdxml files do). With the current scheme it hard for implementers to automate file downloads for testing with the beta. +1000 Eric.
Re: Another take on the English apostrophe in Unicode
On 6/5/2015 10:29 AM, John D. Burger wrote: Linguistically, don't and friends pass all the diagnostics that indicate they're single words. If I am not mistaken, the french pomme de terre also passes the diagnostics. So we need a new space character. Eric.
Re: http://✈.ws
Whoops, sent too soon. A surprise: http://✈.ws Mark https://google.com/+MarkDavis *— Il meglio è l’inimico del bene —* On Fri, Jun 5, 2015 at 4:47 PM, Mark Davis ☕️ m...@macchiato.com wrote: