Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread Leo Broukhis
On Thu, Jun 4, 2015 at 9:25 PM, David Starner prosfil...@gmail.com wrote:

 Hyphens generally make multiple words into one anyway. There's not really
 multiple hyphens the way there's separate quotes and apostrophes.


Generally, but not always, just as apostrophes aren't always at a
contracted word boundary. There is only one hyphen because no language
(AFAIK) claims it as part of its alphabet.

Leo

 On 7:01pm, Thu, Jun 4, 2015 Leo Broukhis l...@mailcom.com wrote:

 Along the same lines, we might need a MODIFIER LETTER HYPHEN, because,
 for example, the work ack-ack isn't decomposable into words, or even
 morphemes, ack and ack.

 Leo

 On Thu, Jun 4, 2015 at 6:31 PM, David Starner prosfil...@gmail.com
 wrote:

 On Thu, Jun 4, 2015 at 2:38 PM Markus Scherer markus@gmail.com
 wrote:

 don’t is a contraction of two words, it is not one word.


 But as he points out, it's not a contraction of don and t; it is, at
 best, a contraction of do and n't. It's eliding, not punctuating. In the
 comments, he also brings up the examples of Don’t you mind? being okay
 but not *Do not you mind?, and fo’c’sle.

  You can't use simple regular expressions to find word boundaries.

 Who uses _simple_ regular expressions? You can't use any code to
 reliably find word boundaries in English, and that's a problem.





Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread Leo Broukhis
 But the point was that treating hyphens as parts of words is not generally a 
 wrong thing.

That brings us back to my original question: where's MODIFIER LETTER
HYPHEN, then? A word is a sequence of letters, isn't it? :)

I agree that conflating apostrophes and quotes is a source of
problems, however, existence of the MODIFIER LETTER [same glyph as
used for English contractions] in Unicode is a coincidence which
should not have an effect on usage of apostrophes in English.

Leo

On Thu, Jun 4, 2015 at 11:58 PM, David Starner prosfil...@gmail.com wrote:
 On June 4, 2015, at 11:01 PM, Leo Broukhis l...@mailcom.com wrote:



On Thu, Jun 4, 2015 at 9:25 PM, David Starner prosfil...@gmail.com wrote:

Hyphens generally make multiple words into one anyway. There's not really
 multiple hyphens the way there's separate quotes and apostrophes.

Generally, but not always, just as apostrophes aren't always at a
 contracted word boundary. There is only one hyphen because no language
 (AFAIK) claims it as part of its alphabet.

 But the point was that treating hyphens as parts of words is not generally a
 wrong thing. There is one generally consistent rule for hyphens. When
 apostrophes and quotes are conflated, there is no one generally acceptable
 rule.


Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread David Starner


On June 4, 2015, at 11:01 PM, Leo Broukhis l...@mailcom.com wrote:



On Thu, Jun 4, 2015 at 9:25 PM, David Starner prosfil...@gmail.com wrote:

Hyphens generally make multiple words into one anyway. There's not really 
multiple hyphens the way there's separate quotes and apostrophes.

Generally, but not always, just as apostrophes aren't always at a contracted 
word boundary. There is only one hyphen because no language (AFAIK) claims it 
as part of its alphabet. 

But the point was that treating hyphens as parts of words is not generally a 
wrong thing. There is one generally consistent rule for hyphens. When 
apostrophes and quotes are conflated, there is no one generally acceptable rule.


Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread QSJN 4 UKR
The conflict is between linguists and programmers. In plain text
apostrophe is a punctuation used instead letters (unreadable, one or
more) or as separator for avoid connecting letters into ligature or
syllable, between parts of composite word as well as inside the simple
word, or finally, as quotation mark. Yes it is ambiguous!
It is. It just is! Linguists say It is. We see that. We know that.
And programmers say That's wrong! We can't understand that. Just are
you so stupid if you can't!
Modifier letter apostrophe is a letter that used as itself and means
itself (ejective sound e.g.) only. Don't use it else. It just make
more confusion.


Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread William_J_G Overington
Markus Scherer wrote:
 How are normal users supposed to find both U+2019 and U+02BC on their 
 keyboards, and how are they supposed to deal with incorrect usage?
Would it be possible to have wordprocessing software where one uses CONTROL 
APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC for input and 
could there be a show in colour mode where U+2019 is displayed in cyan and 
U+02BC is displayed in red, while everything else is displayed in black?
That is, CONTROL U+0027 and CONTROL SHIFT U+0027 respectively.
If people want this facility, maybe it could become published in a Unicode 
Technical Report so that standardization and interoperability could be achieved.
William Overington
5 June 2015


Re: Tag characters and in-line graphics (from Tag characters)

2015-06-05 Thread Asmus Freytag (t)

On 6/4/2015 17:03 , Chris wrote:
This whole discussion is about the fact that it would be technically 
possible to have private character sets and private agreements that 
your OS downloads without the user being aware of it. 


The sticky issues are not the questions of how to make available fonts 
or images for use by the OS.


Instead, they concern the fact that any such  a model violates some 
pretty basic guarantees of plain text that the entire net infrastructure 
relies on.


There are very obvious security issues. The start with tracking; every 
time you access a custom code point, that fact potentially results in a 
trackable interaction. This problem affects even the sticker solution 
that people are hoping for for emoji. (On my system, no external 
resources are displayed when I first open any message, and there is a 
reason for that).


Beyond tracking, and beyond stickers (that is pictures that look like 
pictures) a generalized custom character set would allow text that is 
no longer really stable. You would be able to deliver identical e-mails 
to people that display differently, because when you serve the custom 
fonts, you would be able to customize what you deliver under the same 
custom character set designator.


While this would be a wonderful way to circumvent censorship (other than 
the man in the middle version), you would likewise seriously undermine 
the ability to filter unwanted or undesirable texts, because the custom 
character set engine might recognize when a request comes from a filter 
and not the end user. (Just the other day, I came across a hacked 
website that responded differently to search engined than to live users, 
making the hack effective for one and invisible to the other. Custom 
character sets would seem to just add to the hackers' arsenal here).


Finally, custom character sets sound like a great idea when thinking of 
an extension of an existing character set. But that's not where the 
issues are. The issues come in when you use the same technology to 
provide aliases for existing code points or for other custom characters.


Aliasing undermines the ability to do search (or any other 
content-focused processing, from sorting to spell-check).


At that point, the circle closes.

When Unicode was created, the alternative then was ISO 2022, which was a 
standard that addressed the issue of how to switch among (albeit 
pre-defined) character sets to achieve, in principle, coverage equal to 
the union of these character sets.


Unicode was created to address two main deficiencies of that situation. 
Unification addressed the aliasing issue, so that code points were no 
longer opaque but could be interpreted by software (other than 
display), which was the second big drawback of the patchwork of 
character sets. A processing model for opaque code points is possible to 
define, but it isn't very practical and in the late eighties people had 
had enough were glad to be quit of it.


Seen from this perspective, the discussion about custom character sets 
presents itself as a giant step backward, undermining the very advances 
that underlie the rapid acceptance and spread of Unicode.


A./


Re: Tag characters and in-line graphics (from Tag characters)

2015-06-05 Thread Martin J. Dürst

On 2015/06/04 17:03, Chris wrote:


I wish Steve Jobs was here to give this lecture.


Well, if Steve Jobs were still around, he could think about whether (and 
how many) users really want their private characters, and whether it was 
worth the time to have his engineers working on the solution. I'm not 
sure he would come to the same conclusion as you.



This whole discussion is about the fact that it would be technically possible 
to have private character sets and private agreements that your OS downloads 
without the user being aware of it.

Now if the unicode consortium were to decide on standardising a technological 
process whereby rendering engines could seamlessly download representations of 
custom characters without user intervention, no doubt all the vendors would 
support it, and all the technical mumbo jumbo of installing privately agreed 
character sets would be something users could leave for the technology to sort 
out.


You are right that it would be strictly technically possible. Not only 
that, it has been so for 10 or 20 years.


As an example, in 1996 at the WWW Conference in Paris I was 
participating in a workshop on internationalization for the Web, and by 
chance I was sitting between the participant from Adobe and the 
participant from Microsoft. These were the main companies working on 
font technology at that time, and I asked them how small it would be 
possible to make a font for a single character using their technologies 
(the purpose of such a font, as people on this thread should be able to 
guess, would be as part of a solution to exchange single, user-defined 
characters).


I don't even remember their answers. The important thing here that the 
idea, and the technology, have been around for a long time. So why 
didn't it take on? Maybe the demand is just not as big as some 
contributors on this list claim.


Also, maybe while the technology itself isn't rocket science, the 
responsible people at the relevant companies have enough experience with 
technology deployment to hold back. To give an example of why the 
deployment aspect is important, there were various Web-like hypertext 
technologies around when the Web took off in the 1990. One of them was 
called HyperG. It was technologically 'better' than the Web, in that it 
avoided broken links. But it was much more difficult to deploy, and so 
it is forgotten, whereas the Web took off.


Regards,   Martin.


Re: Tag characters and in-line graphics (from Tag characters)

2015-06-05 Thread William_J_G Overington
Asmus Freytag wrote about security issues.
This is interesting reading and I have learned a lot from the post about 
various security issues.
Whilst the post is in this thread and follows from a post in this thread, the 
topic has seemed to moved to the Custom characters thread.
I note that what you write about seems to me that it would not apply to my 
suggestion in my original post: is that correct?
http://www.unicode.org/mail-arch/unicode-ml/y2015-m05/0218.html
Also the following two posts.
http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0009.html
http://www.unicode.org/mail-arch/unicode-ml/y2015-m06/0027.html
Whilst the ideas raised by Chris are interesting, they do seem to be distinctly 
different from what I suggested.
So, for clarity, do you regard my suggested format as having any security 
issues, and if so, what please?
I know that some people have opined that my suggested format is out of scope 
for Unicode, yet the scope of Unicode is what the Unicode Technical Committee 
decides is the scope of Unicode, and my suggested format does provide a way to 
include custom glyphs within a Unicode plain text document by using the new 
base character followed by tag characters method.
William Overington
5 June 2015


Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread David Starner
On Fri, Jun 5, 2015 at 12:16 AM Leo Broukhis l...@mailcom.com wrote:

 I agree that conflating apostrophes and quotes is a source of
 problems, however, existence of the MODIFIER LETTER [same glyph as
 used for English contractions] in Unicode is a coincidence which
 should not have an effect on usage of apostrophes in English.


Coincidence or not, the Unicode Consortium is not going to allocate a new
code-point for the English apostrophe as long as MODIFIER LETTER APOSTROPHE
exists. Any change is pretty unlikely, but changing to an existing
character is vastly more likely then creating a new one.


Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread David Starner
On Fri, Jun 5, 2015 at 2:43 AM QSJN 4 UKR qsjn4...@gmail.com wrote:

 The conflict is between linguists and programmers.


No, it's not.


 Yes it is ambiguous!
 It is. It just is! Linguists say It is. We see that. We know that.


Now you programmers find some way to deal with that so you can produce
useful corpuses for linguistic work. Which is what this is all about, is
producing good linguistic interpretations of plain text, for, among others,
linguists whose supply of scanned text has exceeded their ability to
hand-process it.


 Modifier letter apostrophe is a letter that used as itself and means
 itself (ejective sound e.g.) only. Don't use it else. It just make
 more confusion.


If you don't know what language a text is in, you can't tell what sounds
letters make. Adding this character to English's repertoire won't change
that.


Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread Kalvesmaki, Joel
I don’t have a particular position staked out. But to this discussion should be 
added the very interesting work done by Zwicky and Pullum arguing that the 
apostrophe is the 27th letter of the Latin alphabet. Neither U+2019 nor U+02BC 
would satisfy that position. See:

Zwicky and Pullum 1983 Zwicky, Arnold M., and Geoffrey K. Pullum. 
Cliticization vs. Inflection: English N’T.Language59, no. 3 (1983): 502–513.

It’s nicely summarized and discussed here:
http://chronicle.com/blogs/linguafranca/2013/03/22/being-an-apostrophe/

jk
--
Joel Kalvesmaki
Editor in Byzantine Studies
Dumbarton Oaks
202 339 6435



Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread William_J_G Overington
Markus Scherer wrote:
 How are normal users supposed to find both U+2019 and U+02BC on their 
 keyboards, and how are they supposed to deal with incorrect usage?
I replied:
 Would it be possible to have wordprocessing software where one uses CONTROL 
 APOSTROPHE for U+2019 and CONTROL SHIFT APOSTROPHE for U+02BC for input and 
 could there be a show in colour mode where U+2019 is displayed in cyan and 
 U+02BC is displayed in red, while everything else is displayed in black?
I am wondering whether some existing software packages might be able to be used 
for the character inputting part using customized keyboard short cuts.
https://community.serif.com/forum/43862/question-about-customized-keyboard-short-cuts
I realize that the cyan and red colours cannot be done at present, yet I
 have now thought of the alternative for now of being able to test what is in 
the text by using a special version 
of an open source font where there are distinctive glyphs one from the 
other for the two characters.
William Overington
5 June 2015


Re: ucd beta, stable filenames

2015-06-05 Thread Daniel Bünzli
Le vendredi, 5 juin 2015 à 16:48, Daniel Bünzli a écrit :
 and/or simply publish it in the version directory but without the suffixes 
 (like the ucdxml files do).

Or both with and without the suffix of course.  

Daniel





Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread John D. Burger

 On Jun 4, 2015, at 17:34 , Markus Scherer markus@gmail.com wrote:
 
 Looks all wrong to me.
 
 don’t is a contraction of two words, it is not one word.

Yes it is. Is keyboard two words? How about newspaper?

If don't is two words, please tell me what two words make up won't? (Hint, 
neither of them is will.)

Linguistically, don't and friends pass all the diagnostics that indicate 
they're single words.

- John Burger

 English is taught as that squiggle being punctuation, not a letter. (Unlike, 
 say, the Hawaiʻian ʻOkina.)
 
 You can't use simple regular expressions to find word boundaries. That's why 
 we have UAX #29.
 
 Confusion between apostrophe and quoting -- blame the scribe who came up with 
 the ambiguous use, not the people who gave it a number.
 
 If anything, Unicode might have made a mistake in encoding two of these that 
 look identical. How are normal users supposed to find both U+2019 and U+02BC 
 on their keyboards, and how are they supposed to deal with incorrect usage?
 
 markus




ucd beta, stable filenames

2015-06-05 Thread Daniel Bünzli
Hello,  

Would it be possible in the future to publish the latest version of the ucd 
files without the -X.Y.ZdW suffixes under a fixed URI like 

  http://www.unicode.org/Public/beta/

and/or simply publish it in the version directory but without the suffixes 
(like the ucdxml files do). With the current scheme it hard for implementers to 
automate file downloads for testing with the beta.

Thanks, 

Daniel




Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread Doug Ewell
QSJN 4 UKR qsjn4ukr at gmail dot com wrote:

 And programmers say That's wrong! We can't understand that. Just are
 you so stupid if you can't!

You know, we really aren't all like that. Some of us actually try to
meet user needs.

--
Doug Ewell | http://ewellic.org | Thornton, CO 




Re: Tag characters and in-line graphics (from Tag characters)

2015-06-05 Thread Doug Ewell
I wrote, crumpled up, and threw away about three different responses. I
thought about ISO 2022 and about accessing the web for every PUA
character, as Asmus mentioned, and about the size of the user base, as
Martin mentioned. I thought about character properties and about
ephemerality.

I didn't think of the spoofing implications that Asmus described, which
would affect both the automatic PUA font download and the inline drawing
language. Either of these could be used to spell out, let's say,
paypal.com rather convincingly and with minimal effort.

I might have more experience with the PUA than many list members, having
transcribed the 27,000-word Alice's Adventures in Wonderland into my
constructed alphabet two years ago, in a PUA encoding, so that Michael
Everson could publish it in book form. One of the many learning
experiences of this project was finding out which software tools play
nicely with the PUA and which don't. Some tools just worked while
others would not give acceptable results with any amount of effort.

At no point, however, did I suppose that a font with my alphabet, or any
of the jillions of others that have been invented during a boring day
in class (see Omniglot for tons of examples), should be silently
downloaded to a user's computer, consuming bandwidth and disk space,
without her knowledge. That's practically malware. Maybe I'm just not
enough of a Distinguished Visionary to understand how insanely great
this would be (unfortunately, celebrity name-dropping doesn't work with
me).

Unicode has stated consistently for at least 23 years that it would not
ever standardize PUA usage, and over the years some UTC members have
used terms like strongly discouraged and not interoperable even in
the presence of an agreement. Given this, and given that no system I'm
aware of magically downloads fonts for *regularly encoded characters* (I
still have no font for Arabic math symbols), I personally would not
expect Unicode to perform a 180 on this.

--
Doug Ewell | http://ewellic.org | Thornton, CO 




Re: ucd beta, stable filenames

2015-06-05 Thread Eric Muller

On 6/5/2015 8:48 AM, Daniel Bünzli wrote:

Hello,

Would it be possible in the future to publish the latest version of the ucd 
files without the -X.Y.ZdW suffixes under a fixed URI like

   http://www.unicode.org/Public/beta/

and/or simply publish it in the version directory but without the suffixes 
(like the ucdxml files do). With the current scheme it hard for implementers to 
automate file downloads for testing with the beta.




+1000

Eric.



Re: Another take on the English apostrophe in Unicode

2015-06-05 Thread Eric Muller

On 6/5/2015 10:29 AM, John D. Burger wrote:

Linguistically, don't and friends pass all the diagnostics that indicate 
they're single words.


If I am not mistaken, the french pomme de terre also passes the 
diagnostics. So we need a new space character.


Eric.



Re: http://✈.ws

2015-06-05 Thread Mark Davis ☕️
Whoops, sent too soon.

A surprise: http://✈.ws


Mark https://google.com/+MarkDavis

*— Il meglio è l’inimico del bene —*

On Fri, Jun 5, 2015 at 4:47 PM, Mark Davis ☕️ m...@macchiato.com wrote:





http://✈.ws

2015-06-05 Thread Mark Davis ☕️