Re: Could U+E0001 LANGUAGE TAG become undeprecated please? There is a good reason why I ask

2020-02-10 Thread Steffen Nurpmeso via Unicode
wjgo_10...@btinternet.com via Unicode wrote in
<141cecf1.23e.1702ea529c1.webtop@btinternet.com>:
 |Could U+E0001 LANGUAGE TAG become undeprecated please? There is a good 
 |reason why I ask
 |
 |There is a German song, Lorelei, and I searched to find an English 
 |translation.

Regarding Rhine and this thing of yours, there is also the German
joke from the middle of the 1950s, i think, with "Tünnes und
Schäl".

  Tünnes und Schäl stehen auf der Rheinbrücke.
  Da fällt Tünnes die Brille in den Fluß und er sagt
  "Da schau, jetzt ist mir die Brille in die Mosel gefallen",
  worauf Schäl sagt, "Mensch, Tünnes, dat is doch de Ring!",
  und Tünnes antwortet "Da kannste mal sehen wie schlecht ich ohne
  Brille sehen kann!"

  Tuennes und Schael stand on the Rhine bridge.
  Then Tuennes glasses fall into the river, and he says
  "Look, now i lost my glasses to the Moselle",
  whereupon Schael says "Crumbs!, Tuennes, that is the Rhine!",
  and Tuennes responds "There you can say how bad i can see
  without glasses!"

P.S.: i cannot speak "Kösch" aka Cologne dialect.
P.P.S.: i think i got you wrong.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Access to the Unicode technical site (was: Re: Unicode's got a new logo?)

2019-07-19 Thread Steffen Nurpmeso via Unicode
Hello Mr. Ken Whistler.

Ken Whistler wrote in <3d1676bb-f3c1-8a3e-fdc5-1c0bdd74a...@sonic.net>:
 |On 7/18/2019 11:50 AM, Steffen Nurpmeso via Unicode wrote:
 |> I also decided to enter /L2 directly from now on.
 |
 |For folks wishing to access the UTC document register, Unicode 
 |Consortium standards, and so forth, all of those links will be 
 |permanently stable. They are not impacted by the rollout of the new home 
 |page and its related content.
 |
 |If you need access to the more technical information from the UTC, 
 |CLDR-TC, ICU-TC, etc., feel free to bookmark such pages as:
 |
 |https://www.unicode.org/L2/
 |
 |for the UTC document register.
 |
 |https://www.unicode.org/charts/
 |
 |for the Unicode code charts index,
 |
 |https://www.unicode.org/versions/latest/
 |
 |for the latest version of the Unicode Standard, and so forth. All such 
 |technical links are stable on the site, and will continue to be stable.

Are these things still linked from the top homepage yet?
Thank you very much for the information.  (My gut feeling is that
it is tremendous that very highly qualified people care for such
vanities.)

 |For general access to the technical content on the Unicode website, see:
 |
 |https://www.unicode.org/main.html
 |
 |which provides easy link access to all the technical content areas and 
 |to the ongoing technical committee work.

I hopefully will come to truly Unicode the things i do!!
(By then programming will hopefully be true fun again.  I hope..)

A nice weekend i wish, from soon sunny again Germany!

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)


Re: Unicode's got a new logo?

2019-07-18 Thread Steffen Nurpmeso via Unicode
Yifán Wáng via Unicode wrote in :
 |I cannot help but notice the new home.unicode.org site embraces a new
 |logo, blue base color with a humanist type, rather than the
 |traditional one, red and geometric. Does anybody know if it means that
 |Unicode wants to renew its logo or that they serve for different
 |purposes? Which should I cite as the official logo? I think I've read
 |the description and the blog post but couldn't find an explanation.

I also decided to enter /L2 directly from now on.
I am happy that you give me the opportunity to finally send a mail
regarding this topic.  (Excuses to the designers from Adobe.)

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Steffen Nurpmeso via Unicode
Philippe Verdy via Unicode wrote in :
 |Padding itself does not clearly indicate the length.
 |
 |It's an artefact that **may** be infered only in some other layers \
 |of protocols which specify when and how padding is needed (and how \
 |many padding bytes 
 |are required or accepted), it works only if these upper layer protocols \
 |are using **octets** streams, but it is still not usable for more general 
 |bitstreams (with arbitrary bit lengths).
 |
 |This RFC does not mandate/require these padding bytes and in fact many \
 |upper layer protocols do not ever need it (including UTF-7 for example), \
 |they are 
 |never necessary to infer a length in octets and insufficient for specify\
 |ing a length in bits.
 |
 |As well the usage in MIME (where there's a requirement that lines of \
 |headers or in the content body is limited to 1000 bytes) requires free \
 |splitting of 
 |Base64 (there's no agreed maximum length, some sources insist it should \
 |not be more than 72 bytes, others use 80 bytes, but mail forwarding \
 |may add other 
 |characters at start of lines, forcing them to be shorter (leaving for \
 |example a line of 72 bytes+CRLF and another line of 8 bytes+CRLF): \
 |this means that 
 |padding may not be used where one would expect them, and padding can \
 |event occur in the middle of the encoded stream (not just at end) along \

That was actually a bug in my MUA.  Other MUAs were not capable of
decoding this correctly.
Sorry :-(!!

 |with other 
 |whitespaces or separators (like "> " at start of lines in cited messages).

In fact garbage bytes may be embedded explicitly says MIME.
Most handle that right, and skip (silently, maybe not right),
but some explicit base64 decoders fail miserably when such things
are seen (openssl base64, NetBSD base64 decoder (current)), others
do not (busybox base64, for example).

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)


Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-15 Thread Steffen Nurpmeso via Unicode
Doug Ewell via Unicode wrote in <2A67B4F082F74F8AADF34BA11D885554@DougEwell>:
 |Steffen Nurpmeso wrote:
 |> Base64 is defined in RFC 2045 (Multipurpose Internet Mail Extensions
 |> (MIME) Part One: Format of Internet Message Bodies).
 |
 |Base64 is defined in RFC 4648, "The Base16, Base32, and Base64 Data
 |Encodings." RFC 2045 defines a particular implementation of base64,
 |specific to transporting Internet mail in a 7-bit environment.
 |
 |RFC 4648 discusses many of the "higher-level protocol" topics that some
 |people are focusing on, such as separating the base64-encoded output
 |into lines of length 72 (or other), alternative target code unit sets or
 |"alphabets," and padding characters. It would be helpful for everyone to
 |read this particular RFC before concluding that these topics have not
 |been considered, or that they compromise round-tripping or other
 |characteristics of base64.
 |
 |I had assumed that when Roger asked about "base64 encoding," he was
 |asking about the basic definition of base64.

Sure; i have only followed the discussion superficially, and even
though everybody can read RFCs, i felt the necessity to polemicize
against the false however i look at it "MIME actually splits
a binary object into multiple fragments at random positions".
Solely my fault.

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)


Re: Base64 encoding applied to different unicode texts always yields different base64 texts ... true or false?

2018-10-13 Thread Steffen Nurpmeso via Unicode
Philippe Verdy via Unicode wrote in :
 |You forget that Base64 (as used in MIME) does not follow these rules \
 |as it allows multiple different encodings for the same source binary. \
 |MIME actually 
 |splits a binary object into multiple fragments at random positions, \
 |and then encodes these fragments separately. Also MIME uses an extension \
 |of Base64 
 |where it allows some variations in the encoding alphabet (so even the \
 |same fragment of the same length may have two disting encodings).
 |
 |Base64 in MIME is different from standard Base64 (which never splits \
 |the binary object before encoding it, and uses a strict alphabet of \
 |64 ASCII 
 |characters, allowing no variation). So MIME requires special handling: \
 |the assumpton that a binary message is encoded the same is wrong, but \
 |MIME still 
 |requires that this non unique Base64 encoding will be decoded back \
 |to the same initial (unsplitted) binary object (independantly of its \
 |size and 
 |independantly of the splitting boundaries used in the transport, which \
 |may change during the transport).

Base64 is defined in RFC 2045 (Multipurpose Internet Mail
Extensions (MIME) Part One: Format of Internet Message Bodies).
It is a content-transfer-encoding and encodes any data
transparently into a 7 bit clean ASCII _and_ EBCDIC compatible
(the authors commemorate that) text.
When decoding it reverts this representation into its original form.
Ok, there is the CRLF newline problem, as below.
What do you mean by "splitting"?

...
The only variance is described as:

  Care must be taken to use the proper octets for line breaks if base64
  encoding is applied directly to text material that has not been
  converted to canonical form.  In particular, text line breaks must be
  converted into CRLF sequences prior to base64 encoding.  The
  important thing to note is that this may be done directly by the
  encoder rather than in a prior canonicalization step in some
  implementations.

This is MIME, it specifies (in the same RFC):

  2.10.  Lines

   "Lines" are defined as sequences of octets separated by a CRLF
   sequences.  This is consistent with both RFC 821 and RFC 822.
   "Lines" only refers to a unit of data in a message, which may or may
   not correspond to something that is actually displayed by a user
   agent.

and furthermore

  6.5.  Translating Encodings

   The quoted-printable and base64 encodings are designed so that
   conversion between them is possible.  The only issue that arises in
   such a conversion is the handling of hard line breaks in quoted-
   printable encoding output. When converting from quoted-printable to
   base64 a hard line break in the quoted-printable form represents a
   CRLF sequence in the canonical form of the data. It must therefore be
   converted to a corresponding encoded CRLF in the base64 form of the
   data.  Similarly, a CRLF sequence in the canonical form of the data
   obtained after base64 decoding must be converted to a quoted-
   printable hard line break, but ONLY when converting text data.

So we go over

  6.6.  Canonical Encoding Model

   There was some confusion, in the previous versions of this RFC,
   regarding the model for when email data was to be converted to
   canonical form and encoded, and in particular how this process would
   affect the treatment of CRLFs, given that the representation of
   newlines varies greatly from system to system, and the relationship
   between content-transfer-encodings and character sets.  A canonical
   model for encoding is presented in RFC 2049 for this reason.

to RFC 2049 where we find

 For example, in the case of text/plain data, the text
  must be converted to a supported character set and
  lines must be delimited with CRLF delimiters in
  accordance with RFC 822.  Note that the restriction on
  line lengths implied by RFC 822 is eliminated if the
  next step employs either quoted-printable or base64
  encoding.

and, later

   Conversion from entity form to local form is accomplished by
   reversing these steps. Note that reversal of these steps may produce
   differing results since there is no guarantee that the original and
   final local forms are the same.

and, later

   NOTE: Some confusion has been caused by systems that represent
   messages in a format which uses local newline conventions which
   differ from the RFC822 CRLF convention.  It is important to note that
   these formats are not canonical RFC822/MIME.  These formats are
   instead *encodings* of RFC822, where CRLF sequences in the canonical
   representation of the message are encoded as the local newline
   convention.  Note that formats which encode CRLF sequences as, for
   example, LF are not capable of representing MIME messages containing
   binary data which contains LF octets not part of CRLF line separation
   sequences.

Whoever understands this emojibake.
My MUA still gnaws at 

Re: Tales from the Archives

2018-08-20 Thread Steffen Nurpmeso via Unicode
Terrible!

Ken Whistler wrote in <12e6ad91-89e4-ec87-85ad-8fc4ab3f6...@att.net>:
 |Steffen,
 |
 |Are you looking for the Unicode list email archives?
 |
 |https://www.unicode.org/mail-arch/
 |
 |Those contain list content going back all the way to 1994.

Dear Ken Whistler, no, and yes, having an archive is very good,
though your statement from 1997-07-16 ("Plan 9 (a Unix OS) uses
UTF-8") i cannot agree with (it feels very different from Unix).

It was just that i have read on one of the mailing-lists i am
subscribed to a cite of a Unicode statement that i have never read
of anything on the Unicode mailing-list.  It is very awkward, but
i _again_ cannot find what attracted my attention, even with the
help of a search machine.  I think "faith alone will reveal the
true name of shuruq" (1997-07-18).

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)


Re: Tales from the Archives

2018-08-20 Thread Steffen Nurpmeso via Unicode
James Kass via Unicode wrote in :
  ...
 |Eighteen years pass, display issues have mostly gone away, nearly
 |everything works "out-of-the-box", and list traffic has dropped
 |dramatically.  Today's questions are usually either highly technical
 |or emoji-related.
 |
 |Recent threads related to emoji included some questions and issues
 |which remain unanswered in spite of the fact that there are list
 |members who know the answers.
 |
 |This gives the impression that the Unicode public list has become
 |passé.  That's almost as sad as looking down the archive posts, seeing
 |the names of the posters, and remembering colleagues who no longer
 |post.
 |
 |So I'm wondering what changed, but I don't expect an answer.

I have the impression that many things which have been posted here
some years ago are now only available via some Forums or other
browser based services.  What is posted here seems to be mostly
a duplicate of the blog only.  (And the website has its pitfalls
too, for example [1] is linked from [2], but does not exist.)

  [1] http://www.unicode.org/resources/readinglist.html
  [2] http://www.unicode.org/publications/

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)



Re: Split a UTF-8 multi-octet sequence such that it cannot be unambiguously restored?

2017-07-24 Thread Steffen Nurpmeso via Unicode
"Costello, Roger L. via Unicode"  wrote:
 |Suppose an application splits a UTF-8 multi-octet sequence. The application \
 |then sends the split sequence to a client. The client must restore \
 |the original sequence. 
 |
 |Question: is it possible to split a UTF-8 multi-octet sequence in such \
 |a way that the client cannot unambiguously restore the original sequence?
 |
 |Here is the source of my question:
 |
 |The iCalendar specification [RFC 5545] says that long lines must be folded:
 |
 | Long content lines SHOULD be split
 |  into a multiple line representations
 |  using a line "folding" technique.
 |  That is, a long line can be split between
 |  any two characters by inserting a CRLF
 |  immediately followed by a single linear
 |  white-space character (i.e., SPACE or HTAB).
 |
 |The RFC says that, when parsing a content line, folded lines must first \
 |be unfolded using this technique:
 |
 | Unfolding is accomplished by removing
 |  the CRLF and the linear white-space
 |  character that immediately follows.
 |
 |The RFC acknowledges that simple implementations might generate improperly \
 |folded lines:
 |
 | Note: It is possible for very simple
 | implementations to generate improperly
 |  folded lines in the middle of a UTF-8
 |  multi-octet sequence.  For this reason,
 |  implementations need to unfold lines
 |  in such a way to properly restore the
 |  original sequence.

That is not what the RFC says.  It says that simple
implementations simply split lines when the limit is reached,
which might be in the middle of an UTF-8 sequence.  The RFC is
thus improved compared to other RFCs in the email standard
section, which do not give any hints on how to do that.  Even
RFC 2231, which avoids many of the ambiguities and problems of RFC
2047 (for a different purpose, but still), does not say it so
exactly for the reversing character set conversion (which i for
one perform _once_ after joining together the chunks, but is not
a written word and, thus, ...).

--steffen
|
|Der Kragenbaer,The moon bear,
|der holt sich munter   he cheerfully and one by one
|einen nach dem anderen runter  wa.ks himself off
|(By Robert Gernhardt)


Re: "A Programmer's Introduction to Unicode"

2017-03-15 Thread Steffen Nurpmeso
"Doug Ewell"  wrote:
 |Philippe Verdy wrote:
 |>>> Well, you do have eleven bits for flags per codepoint, for example.
 |>>
 |>> That's not UCS-4; that's a custom encoding.
 |>>
 |>> (any UCS-4 code unit) & 0xFFE0 == 0
 |
 |(changing to "UTF-32" per Ken's observation)
 |
 |> Per definition yes, but UTC-4 is not Unicode.
 |
 |I guess it's not. What is UTC-4, anyway? Another name for a UWG meeting
 |held in 1989?
 |
 |> As well (any UCS-4 code unit) & 0xFFE0 == 0 (i.e. 21 bits) is not
 |> Unicode, UTF-32 is Unicode (more restrictive than just 21 bits which
 |> would allow 32 planes instead of just the 17 first ones).
 |
 |I used bitwise arithmetic strictly to address Steffen's premise that the
 |11 "unused bits" in a UTF-32 code unit were available to store metadata
 |about the code point. Of course UTF-32 does not allow 0x11 through
 |0x1F either.
 |
 |> I suppose he meant 21 bits, not 11 bits which covers only a small part
 |> of the BMP.
 |
 |No, his comment "you do have eleven bits for flags per codepoint" pretty
 |clearly referred to using the "extra" 11 bits beyond what is needed to
 |hold the Unicode scalar value.

It surely is a weak argument for a general string encoding.  But
sometimes, and for local use cases it surely is valid.  You could
store the wcwidth(3) plus a graphem codepoint count both in these
bits of the first codepoint of a cluster, for example, and, then,
that storage detail hidden under an access method interface.

--steffen


Re: "A Programmer's Introduction to Unicode"

2017-03-14 Thread Steffen Nurpmeso
Alastair Houghton  wrote:
 |On 13 Mar 2017, at 21:10, Khaled Hosny  wrote:
 |> On Mon, Mar 13, 2017 at 07:18:00PM +, Alastair Houghton wrote:
 |>> On 13 Mar 2017, at 17:55, J Decker  wrote:
 |>>> 
 |>>> I liked the Go implementation of character type - a rune type - \
 |>>> which is a codepoint.  and strings that return runes from by index.
 |>>> https://blog.golang.org/strings
 |>> 
 |>> IMO, returning code points by index is a mistake.  It over-emphasises
 |>> the importance of the code point, which helps to continue the notion
 |>> in some developers’ minds that code points are somehow “characters”.
 |>> It also leads to people unnecessarily using UCS-4 as an internal
 |>> representation, which seems to have very few advantages in practice
 |>> over UTF-16.
 |> 
 |> But there are many text operations that require access to Unicode code
 |> points. Take for example text layout, as mapping characters to glyphs
 |> and back has to operate on code points. The idea that you never need to
 |> work with code points is too simplistic.
 |
 |I didn’t say you never needed to work with code points.  What I said \
 |is that there’s no advantage to UCS-4 as an encoding, and that there’s \

Well, you do have eleven bits for flags per codepoint, for example.

 |no advantage to being able to index a string by code point.  As it \

With UTF-32 you can take the very codepoint and look up Unicode
classification tables.

 |happens, I’ve written the kind of code you cite as an example, including \
 |glyph mapping and OpenType processing, and the fact is that it’s no \
 |harder to do it with a UTF-16 string than it is with a UCS-4 string. \
 | Yes, certainly, surrogate pairs need to be decoded to map to glyphs; \
 |but that’s a *trivial* matter, particularly as the code point to glyph \
 |mapping is not 1:1 or even 1:N - it’s N:M, so you already need to cope \
 |with being able to map multiple code units in the string to multiple \
 |glyphs in the result.

If you have to iterate over a string to perform some high-level
processing then UTF-8 is a choice almost equally fine, for the
very same reasons you bring in.  And if the usage pattern
"hotness" pictures that this thread has shown up at the beginning
is correct, then the size overhead of UTF-8 that the UTF-16
proponents point out turns out to be a flop.

But i for one gave up on making a stand against UTF-16 or BOMs.
In fact i have turned to think UTF-16 is a pretty nice in-memory
representation, and it is a small step to get from it to the real
codepoint that you need to decide what something is, and what has
to be done with it.  I don't know whether i would really use it
for this purpose, though, i am pretty sure that my core Unicode
functions will (start to /) continue to use UTF-32, because the
codepoint to codepoint(s) is what is described, and onto which
anything else can be implemented.  I.e., you can store three
UTF-32 codepoints in a single uint64_t, and i would shoot myself
in the foot if i would make this accessible via an UTF-16 or UTF-8
converter, imho; instead, i (will) make it accessible directly as
UTF-32, and that serves equally well all other formats.  Of
course, if it is clear that you are UTF-16 all-through-the-way
then you can save the conversion, but (the) most (widespread)
Uni(x|ces) are UTF-8 based and it looks as if that would stay.
Yes, yes, you can nonetheless use UTF-16, but it will most likely
not safe you something on the database side due to storage
alignment requirements, and the necessity to be able to access
data somewhere.  You can have a single index-lookup array and
a dynamically sized database storage which uses two-byte
alignment, of course, then i can imagine UTF-16 is for the better.
I never looked how ICU does it, but i have been impressed by sheer
data facts ^.^

--steffen



Re: I'm excited about the proposal to add a brontosaurus emoji codepoint

2016-08-29 Thread Steffen Nurpmeso
Leonardo Boiko  wrote:
 |We obviously need an emoji for every species name listed within The \
 |Official Registry of Zoological Nomenclature.

Ride it out.  Ride it out.
Oh, it shouldn't take that much longer if we all go for it.

--steffen


Re: Encoding the Mayan Script:

2016-06-04 Thread Steffen Nurpmeso
 |http://blog.unicode.org/2016/06/encoding-mayan-script-your-adopt.html
 |
 |This is great news. Congratulations to both UTC and the sponsors for
 |helping to fund this worthwhile encoding effort.

I concur with all my heart!

  Are uschê ocher zîch warâl K'itschê' ub'î'.

Good luck!

  Are uzîchoschik wa'e:
k'ä kaz'ininoq, k'ä katschamamoq, kaz'inonik,
  k'ä kasilanik, k'ä kalolonik, katölona putsch upa kâch.

May the force be with you!

--steffen



Re: Surrogates and noncharacters

2015-05-12 Thread Steffen Nurpmeso
Hans Aberg haber...@telia.com wrote:
 | On 12 May 2015, at 16:50, Philippe Verdy verd...@wanadoo.fr wrote:
 | Indeed, that is why UTF-8 was invented for use in Unix-like environments.
 | 
 | Not the main reason: communication protocols, and data storage \
 | is also based on 8-bit code units (even if storage group \
 | them by much larger blocks).
 |
 |There is some history here:
 |  https://en.wikipedia.org/wiki/UTF-8#History

What happened was this:

  http://doc.cat-v.org/bell_labs/utf-8_history

--steffen


Re: Question about “Uppercase” in DerivedCoreProperties.txt

2014-11-10 Thread Steffen Nurpmeso
Philippe Verdy verd...@wanadoo.fr wrote:
 |glibc is not more borken and any other C library implementing toupper and
 |tolower from the legacy ctype standard library. These are old APIs that
 |are just widely used and still have valid contexts were they are simple and
 |safe to use. But they are not meant to convert text.

Hah!  Legacy is good..  I'd wish a usable successor were already
standardized by ISO C.

--steffen
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Question about “Uppercase” in DerivedCoreProperties.txt

2014-11-10 Thread Steffen Nurpmeso
Philippe Verdy verd...@wanadoo.fr wrote:
 |Successors to convert strings instead of just isolated characters (sorry,
 |they are NOT what we need to handle texts, they are not even equivalent
 |to Unicode characters, they are just code units, most often 8-bit with
 |char or 16-bit only with wchar_t !) already exist in all C libraries
 |(including glibc), under different names unfortunately (this is the main
 |cause why there are complex header files trying to find the appropriate
 |name, and providing a default basic implementation that just scans
 |individual characters to filter them with tolower and toupper: this is a
 |bad practice,

glibc is the _only_ standard C library i know of that supports its
own homebrew functionality regarding the issue (and in a way that
i personally don't want to and will never work with).
Even the newest ISO C doesn't give just any hand, so that no ISO C
programmer can expect to use any standard facility before 2020, if
that is the time, and then operating systems have to adhere to
that standard, and then programmers have to be convinced to use
those functions.
Until then different solutions will have to be used.

--steffen
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Question about “Uppercase” in DerivedCoreProperties.txt

2014-11-10 Thread Steffen Nurpmeso
Philippe Verdy verd...@wanadoo.fr wrote:
 |The standard C++ string package could have then used this standard
 |internally in the methods exposed in its API. I cannot understand this
 |simple effort was never done on such basic functionality needed and used in
 |almost all softwares and OSes.

There are plenty of other things one can bang his head on as
necessary, _that_ is for sure.  Even overwhelmingly, the
pessimistic may say.

--steffen
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Off-topic: Tate Britain After Dark

2014-08-13 Thread Steffen Nurpmeso
William_J_G Overington wjgo_10...@btinternet.com wrote:
 |http://afterdark.tate.org.uk

Let's just hope those won't cause any damages!
Heaven only knows who wrote their control software...

And how beautiful that old stuff looks when enlightened by
headlights!

--steffen
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Apparent discrepanccy between FAQ and Age.txt

2014-06-10 Thread Steffen Nurpmeso
Hello,

Karl Williamson pub...@khwilliamson.com wrote:
 |The FAQ http://www.unicode.org/faq/private_use.html#sentinels
 |says that the last 2 code points on the planes except BMP were made 
 |noncharacters in TUS 3.1.  DerivedAge.txt gives 2.0 for these.

The (nothing but informational except for @missing lines) comments
in DerivedAge.txt state very clearly:

 # - The supplementary private use code points and the non-character code points
 #   were assigned in version 2.0, but not specifically listed in the UCD
 #   until versions 3.0 and 3.1 respectively.

 |The conformance wording about U+FFFE and U+ changed somewhat in 
 |Unicode 2.0, but these were still the only two code points with this 
 |unique status
 |
 |Unicode 3.1 [2001] was the watershed for the development of 
 |noncharacters in the standard. Unicode 3.1 was the first version to add 
 |supplementary characters to the standard. As a result, it also had to 
 |come to grips with the fact the ISO/IEC 10646-2:2001 had reserved the 
 |last two code points for every plane as not a character

Less scattering of information would be a pretty cool thing
nonetheless.  I.e., i think it would be less academical but much
nicer if no FAQ would be necessary at all because the standard as
such covers background information, too.
I remember that one of the reasons i stopped any effort to go with
(the about 120 German Mark book of) Unicode 3.0 was that i was
incapable to wrap my head around a combining arabic example
somewhere; you need access to technical reports to get it done.

--steffen

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Corner cases (was: Re: UTF-16 Encoding Scheme and U+FFFE)

2014-06-06 Thread Steffen Nurpmeso
Doug Ewell d...@ewellic.org wrote:
 |Philippe Verdy verdy underscore p at wanadoo dot fr wrote:
 | Not necessarily true.
 |
 | [602 words]
 |
 |This has nothing to do with the scenario I described, which involved
 |removing a BOM from the start of an arbitrary fragment of data,
 |thereby corrupting the data because the BOM was actually a ZWNBSP.
 |
 |If you have an arbitrary fragment of data, don't fiddle with it.
 |
 |If you know enough about the data to fiddle with it safely, it's not
 |arbitrary.

Yeah!
E.g., on the all-UTF-8 Plan9 research operating system:

  ?0[9front.update_bomb_git]$ git ls-files --with-tree=master --|wc -l
 44983
  ?0[9front.update_bomb_git]$ git grep -lI `print '\ufeff'` master|wc -l
12
  ?0[9front.update_bomb_git]$ git grep -lI `print '\ufeff'` master
  master:9front.hg/lib/font/bit/MAP
  master:9front.hg/lib/glass
  master:9front.hg/sys/lib/troff/font/devutf/0100to25ff
  master:9front.hg/sys/lib/troff/font/devutf/C
  master:9front.hg/sys/lib/troff/font/devutf/CW
  master:9front.hg/sys/lib/troff/font/devutf/H
  master:9front.hg/sys/lib/troff/font/devutf/LucidaSans
  master:9front.hg/sys/lib/troff/font/devutf/PA
  master:9front.hg/sys/lib/troff/font/devutf/R
  master:9front.hg/sys/lib/troff/font/devutf/R.nomath
  master:9front.hg/sys/src/ape/lib/utf/runetype.c
  master:9front.hg/sys/src/libc/port/runetype.c

--steffen
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Guillements in Email

2014-05-02 Thread Steffen Nurpmeso
Sorry for not replying in the thread, and jumping in general, i'm
currently «jolly well fed up» of dealing with mail, but...

Philippe Verdy wrote:
  |I do not criticize the fact of using quoted-printable; but the
  |fact that of NOT using it to preserve characters; based on an
  |arbitrary selection of characters

It is also thinkable that Google -- definitely capable of ESMTP
(RFC 1869) -- only falls back to QM or Base64 if the message would
otherwise not conform to the standard.
I.e., i'm thinking of line length issues here, which is not
unlikely given that today everybody composes in ...-based
textboxes, and 1000 bytes are reached pretty soon.

--steffen

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: ID_Start, ID_Continue, and stability extensions

2014-04-28 Thread Steffen Nurpmeso
Markus Scherer markus@gmail.com wrote:
 |On Fri, Apr 25, 2014 at 6:05 AM, Steffen Nurpmeso sdao...@yandex.comwrote:
 |So imho it's a bit like «Kraut und Rüben» («higgledy-piggledy»
 | sayy http://www.dict.cc/?s=Kraut+und+R%C3%BCben).
 |
 |Ich weiß was das bedeutet :-)

hmmm, possibly a bit of a strong wording.
In no way a personal attack against a real person.
Unicode grew over two decades, only logical that this results in
loose tissue here and there.

 |I parse most of the UCD .txt files with a Python script and munge them into

Ugh this sounds terrible!  Programmers should have the option to
choose the right tools for the right tasks, i mean, payment and
everything is nice, but in the end it is our own life time...

 |Unicode also publishes XML versions of the data, with most or all

Yes, sorry, but i'm not taking a soapy bath in a privately owned
ocean but instead am dealing with a washtub.
150 MB of shock-headed data that yet machines have troubles with!
Even in the end the text files i need will be a tenth of that, and
i'm working with them (especially UnicodeData.txt) uncountable
times, i.e., direct human - text interaction.

 |You could also just use a library that provides these properties, rather
 |than roll your own.
 |Shameless plug for ICU here which has most of the low-level properties in
 |source code (from a generator), so no data loading for those. Ask the
 |icu-support
 |list http://site.icu-project.org/contacts for help if needed.

But there still *are* products their creators can be prowd of, so
no need for pudency of any kind, imho.
It is of course not as common as in other cultures, say, Turkish 
goldsmiths, African silversmiths or Japanese swordsmiths and
ceramists et cetera, but, so all the more remarkable.

 |http://www.unicode.org/reports/tr18/#Compatibility_Properties

Maybe i turn to use a two-pass thing for my own little project, in
order to use the final category.  Right now i'm single-pass and am
thus required to use ugly things like, e.g.,

  ..
  {.name=Other_Alphabetic, .props=sct_ALPHA, .addprint=true},
  {.name=Ideographic, .props=sct_IDEOGRAPH, .addprint=true},
  ..
  /* Control characters, including the Zl and Zp separators (imho misplaced
   * and should go C) are not PRINTable */
  if (pp-addprint  !(p  (sct_Cc | sct_Cs | sct_Co | sct_Zl | sct_Zp))) {
 p |= sct_PRINT;
 /* And whitespace is not GRAPHical */
 if (!(p  sct_Zs))
p |= sct_GRAPH;
  }
  ..

 |Viele Grüße,

Oh.  No mention of this brilliant idea of mine, PropRecipe.txt?
Have a nice weekend. :)

Ciao,

--steffen

P.S.:

 |Google Internationalization Engineering

Oh Google, cute little thing you.

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: ID_Start, ID_Continue, and stability extensions

2014-04-25 Thread Steffen Nurpmeso
Hello,

Markus Scherer markus@gmail.com wrote:
 |On Thu, Apr 24, 2014 at 12:56 PM, Steffen Nurpmeso sdaode\
 |n...@yandex.comwrote:
 | Markus Scherer markus@gmail.com wrote:
 ||I strongly recommend you parse the derived properties rather than trying
 | to
 ||follow the derivation formula, because that can change over time.
 |
 | ..this file includes only those core properties that have
 | themselves a derivation-may-change property?
 |
 |I don't know what that means.

 |What I tried to say is, if you need ID_Start, then parse ID_Start from
 |DerivedCoreProperties.txt. That's more stable (and easier than parsing the
 |pieces and deriving
 |
 |#  Lu + Ll + Lt + Lm + Lo + Nl
 |#+ Other_ID_Start
 |#- Pattern_Syntax
 |#- Pattern_White_Space
 |
 |yourself.

But i *do* need to parse several many pieces (since i'm hardly
interested in ID_Start only)!

Unicode has DerivedAge.txt (i don't know where that is derived
from) and i need to parse PropList.txt anyway (to get the full
list of whitespace characters, for example).

So imho it's a bit like «Kraut und Rüben» («higgledy-piggledy»
sayy http://www.dict.cc/?s=Kraut+und+R%C3%BCben).

 |For example, at least one of the derivation formulas (for Alphabetic) is
 |changing from 6.3 to 7.0.

That is interesting or frightening, i don't know yet.

Wouldn't it make sense to introduce a single PropListsJoined.txt
that does it all.  Or, for the sake of small and possibly
space-constrained projects..

  ?0[steffen@sherwood ]$ (cd ~/arena/docs.coding/unicode/data;
   ll DerivedCore* PropList*)
   100 [.]   99531 25 Sep  2013 PropList.txt
   820 [.]  836985 25 Sep  2013 DerivedCoreProperties.txt

..and this is what i would do: offer a new file, say, Formula.txt,
which defines exactly the necessary formula, e.g., to quote your
example

 Alphabetic
  UnicodeData.txt
  PropList.txt
 + Lu + Ll + Lt
 + Lm
 + Lo + Nl
 + Other_ID_Start
 - Pattern_Syntax
 - Pattern_White_Space
 =

That concept seems to be scalable at first glance.  Old parsers
will not generate correct data in the future anymore if
i understood correctly?  At least there should be
a formular-compatibility version tag added somewhere, so that
parsers can prevent themselves from generating incorrect data and
automatically.

I don't know why there need to be megabytes of duplicated data.
Ach; and i'm not gonna start to dream of better support for ISO
C / POSIX character classes.  (Oh.  ...It's surely sapless.)
Ciao,

--steffen

___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: ID_Start, ID_Continue, and stability extensions

2014-04-24 Thread Steffen Nurpmeso
Markus Scherer markus@gmail.com wrote:
 |I strongly recommend you parse the derived properties rather than trying to
 |follow the derivation formula, because that can change over time.

..this file includes only those core properties that have
themselves a derivation-may-change property?
(I long hesitated to write this though.)

--steffen
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode


Re: Names for control characters

2014-03-13 Thread Steffen Nurpmeso
 |So then the wizard Unicode and the warlock 10646 started casting
 |their spells together.

Fantastic reading.

 |Shazaamaazama! Pockety spoketi! Keeeraack!

History is made by winners.

--steffen
___
Unicode mailing list
Unicode@unicode.org
http://unicode.org/mailman/listinfo/unicode