Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)

2013-07-05 Thread Denis Jacquerye
On Thu, Jul 4, 2013 at 12:07 PM, Michael Everson ever...@evertype.com wrote:
 On 4 Jul 2013, at 03:56, Phillips, Addison addi...@lab126.com wrote:

 I don't disagree with the potential need for changing the decomposition. 
 That discussion seems clear and is only muddled by talking about variant, 
 language sensitive rendering. That isn't the only consideration, right?

 No, Addison, we can't change the decomposition, That would invalidate all the 
 data everywhere in Latvia.

 I disagree that language tagging is not a valid means of getting language 
 specific shaping (which could solve a specific problem). This is hardly 
 confined to CJK or Latvian. Minority languages can, in fact, take advantage 
 of it, within reason (documentation is a problem and it presupposes that 
 glyph support is available). In fact, in some ways, language based glyph 
 selection is possibly easier to achieve because the number of 
 implementations is relatively small.

 The problem is in pretending that a cedilla and a comma below are equivalent 
 because in some script fonts in France or Turkey routinely write some sort of 
 undifferentiated tick for ç. :-)

Sure they are not equivalent, but stop pretending it is only in some
script fonts, the page http://typophile.com/node/49347 has plenty of
examples where it is not in script fonts. In some languages the
cedilla can have a shape similar to that of a comma, it's a fact.
Any native speaker will tell you the comma-like form and others are
acceptable. Just look at lemonde.fr or zaman.com.tr, both very popular
newspapers use webfonts with non classic cedilla (Le Monde uses TheMix
—even in print it uses TheAntiqua with their comma-like cedilla— and
Zaman uses a custom font with an attached tick-like cedilla).
This is not the majority but it is frequent enough.

 As far as I can see the only solution is:

 Mandate that only the comma-below shape is appropriate for Ḑḑ Ģģ Ķķ Ļļ Ņņ Ŗŗ 
 despite their decomposition to cedilla.
 Encode a set of undecomposable Dd Gg Kk Ll Nn Rr with invariant cedilla for 
 display of that glyph with those base letters.

 The only strangeness here is that D̦d̦ G̦g̦ K̦k̦ L̦l̦ N̦n̦ R̦r̦ with genuine 
 combining comma below are confusable with the Latvian/Livonian letters, but 
 that is already the case.

 None of this addresses the problem of pain text representation or the 
 potential need to represent what are apparently different characters with a 
 single encoding. But if it is just presentation we're talking about... how 
 does this differ from, for example, Serbian vs Russian?

 What, the italic lowercase т? That is really not comparable to this issue.

 Michael Everson * http://www.evertype.com/






--
Denis Moyogo Jacquerye
African Network for Localisation http://www.africanlocalisation.net/
Nkótá ya Kongó míbalé --- http://info-langues-congo.1sd.org/
DejaVu fonts --- http://www.dejavu-fonts.org/




Re: writing in an alphabet with fewer letters: letter replacements

2013-07-05 Thread Richard Wordingham
On Thu, 04 Jul 2013 22:19:11 -0700
Stephan Stiller stephan.stil...@gmail.com wrote:

 I know of standards for transcribing foreign alphabets (by /target/ 
 locale – Are they relevant here? If so, which?), but 

This may well depend on both source and target locale!  How often
will locale have to be broken down on a non-local basis?  Different
newspapers in the same city may have different conventions!

 ...there must also
 be popular practices (by /source/ locale) that have developed among
 each locale's residents for traveling.

Also old conventions for limited telegraphic equipment and computer
systems. For example, there is a Romanian tradition of converting
combining squiggle below to a following 'z'.  I've not seen that
tradition used by British newspapers - the squiggle is just dropped.
I've seen French comments in Fortran that just drop all the accents -
most disconcerting to read!

Richard.




Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)

2013-07-05 Thread Philippe Verdy
All this discussion if going to nowhere.
What would be more decisive would the fact that these shapes for celillas
had constrasting uses in any language. As far as I can tell, this has not
been demonstrated (not even in Romanian).
So the proposal is to disunify characters that are already encoded, with
the addition of new confusables. Do we need that for any distinction in any
language ? Does Marshallese really care about the shape of its cedillla or
comma below when they are perceived as equivalent, or used interchangeably ?
For now we've not seen any assertion stating that the use of the wrong
prefered shape was an orthographic error.
The orthography is flexible enough to not care about such visual
differences, and that's why font styles can also be changed without chaing
the actual meaning of text.
That's why nobody ever complained that LeMonde.fr used a comma-like cedilla
for French. People don't care about this visual detail, it is just a
stylistic choice.
But people will care about stable orthographies if plan text searches or
matches don't find the text that it is supposed to find. That's why for
French we now need a collation rule that will equate all these shape
variants (when they are cont canoncally equivalent). But adding more visual
confusables will just impact French now, by forcing us to add an additional
collation rule to equate these non canonically equivalents. This done, we
will continue to ignore these differences in French (with a very minor
binary difference for searches).

But for exact matches (when the encoded text is used as an identifier, such
as filenames), we will still want to make sure that encoded strings are
canoncally equivalent. This won't be possible with the new proposed
characters (meaning that they won't be used in French. But for Marshallese,
these new recommended alternatives will create new difficulties, without
really solving the problem.

I don't think that language tagging is even necessary : a Marshaese user
that will want to see comma or cedillas as he wants for these characters
can just have its own personal preferences in his user profile, stating
that that he prefers reading text like in Marshallese, and will just
represent these encoded cedillas/comma below as expected for Marshallese,
even if the encoded text is not written in Marshallese. And I don't think
that any one will complain (except if that user cannot use a Marshallese
locale, because it is not supported by his software environment...
something that can change... in which case he will just see documents, web
sites and OS interfaces localized in another language, in which the visual
rendering of Marshallese will still be coherent for the context of another
language which has different visual preferences).

I think that this situation is similar for the visual representation of
sinograms, depending on user's locale preferences, if he's Japanese,
Korean, continental Chinese, Southern Chinese (in Hong Kong or Macau),
Singaporian, or lives in another country anywhere else in the world... The
situation can be solved by adding user preferences in his local environment
to select the prefered set of shapes when explicit language tagging of
documents is not applicable.

2013/7/5 Denis Jacquerye moy...@gmail.com

 On Thu, Jul 4, 2013 at 12:07 PM, Michael Everson ever...@evertype.com
 wrote:
  On 4 Jul 2013, at 03:56, Phillips, Addison addi...@lab126.com wrote:
 
  I don't disagree with the potential need for changing the
 decomposition. That discussion seems clear and is only muddled by talking
 about variant, language sensitive rendering. That isn't the only
 consideration, right?
 
  No, Addison, we can't change the decomposition, That would invalidate
 all the data everywhere in Latvia.
 
  I disagree that language tagging is not a valid means of getting
 language specific shaping (which could solve a specific problem). This is
 hardly confined to CJK or Latvian. Minority languages can, in fact, take
 advantage of it, within reason (documentation is a problem and it
 presupposes that glyph support is available). In fact, in some ways,
 language based glyph selection is possibly easier to achieve because the
 number of implementations is relatively small.
 
  The problem is in pretending that a cedilla and a comma below are
 equivalent because in some script fonts in France or Turkey routinely write
 some sort of undifferentiated tick for ç. :-)

 Sure they are not equivalent, but stop pretending it is only in some
 script fonts, the page http://typophile.com/node/49347 has plenty of
 examples where it is not in script fonts. In some languages the
 cedilla can have a shape similar to that of a comma, it's a fact.
 Any native speaker will tell you the comma-like form and others are
 acceptable. Just look at lemonde.fr or zaman.com.tr, both very popular
 newspapers use webfonts with non classic cedilla (Le Monde uses TheMix
 —even in print it uses TheAntiqua with their comma-like cedilla— and
 Zaman uses 

Re: writing in an alphabet with fewer letters: letter replacements

2013-07-05 Thread Stephan Stiller

Hi Richard,


I know of standards for transcribing foreign alphabets (by /target/
locale – Are they relevant here? If so, which?) [...]

This may well depend on both source and target locale!  How often
will locale have to be broken down on a non-local basis?  Different
newspapers in the same city may have different conventions!
It also depends on the time/era, and sometimes there's just a mess. I 
recall the chaotic variation in the rendering of names of Eastern 
European composers in German (and that of foreign names in general). And 
I think it's futile to try to precisely describe journalistic practice 
in this domain.


What I had in mind was more specific: Germans are supposed to convert 
[ä,ö,ü,ß] to [ae,oe,ue,ss], though I don't know what's considered 
best/legal wrt documents required for entering the US, for example.


I was thinking that there might be a similar Icelandic tradition of 
mapping non-{A, ..., Z}-letters into the {A, ..., Z}∪punctuation space, 
for the purpose of filling out forms in another country and such. In 
that regard, I was wondering whether any of the numerous transcription 
schemes that are floating around (and are sometimes backed by 
standardization bodies) play a role here and are prescriptive or (if 
they are not prescriptive) followed to some extent.



For example, there is a Romanian tradition of converting
combining squiggle below to a following 'z'.
squiggle – you're reminding me of /that other thread/ going on right 
now ;-)



drop all the accents
That (more generally: finding a root / approximation / approximating 
digraph) might be the most common method (my wild guess based on casual 
observation, and it's not exactly a particularly difficult guess to 
make), but for ð/þ it's not clear what people will do. I'll save the 
list the obvious speculation and let someone who knows answer directly.


Stephan



Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)

2013-07-05 Thread Martin J. Dürst

On 2013/07/05 16:04, Denis Jacquerye wrote:

On Thu, Jul 4, 2013 at 12:07 PM, Michael Eversonever...@evertype.com  wrote:



The problem is in pretending that a cedilla and a comma below are equivalent 
because in some script fonts in France or Turkey routinely write some sort of 
undifferentiated tick for ç. :-)


Can we make sure we have covered this from the other side? Are there any 
languages where there is a letter where both the form with a cedilla and 
the form with a comma below are used, and are distinguished? In other 
words, are there any languages where a user seeing a wrong form would be 
confused as to what the letter is, rather than being potentially 
surprised or annoyed at the details of the shape?


Regards,   Martin.




Re: writing in an alphabet with fewer letters: letter replacements

2013-07-05 Thread Philippe Verdy
2013/7/5 Richard Wordingham richard.wording...@ntlworld.com

 I've seen French comments in Fortran that just drop all the accents -
 most disconcerting to read!

This is an old problem. It first appeared because lack of Unicode support
in famous historic programming languages (and it persists today in common
languages like C, C++, Basic, Fortran, Cobol, Ada, Pascal...), but now it
should be noted that many French programmers have a very poor level of
orthography and don't know how to select the proper accents (they also
often don't care about correct capitalization, using English rules in their
programs, or copying the capitalization of English in programs UI, such as
capitals on every word, even where it would not even be used notammly in
English; correct punctuatin is also frequently ignored.
In propram comments, nobody cares about that, because users of applications
will notmally not see them, and because I18n and L10n is handled elsewhere
in code and data).

If you are trying to learn French, you won't understand the contents found
in many French-speaking popular talk areas or forums (and it will be worse
in many short Facebook status, Tweets, SMS, and chat: if you're not
cumfortable with the most common deviations, you will hate these places,
and will not use it very often for some time, then will learn how to use
these only for selected users and communities...).

They are generally horroble and there's even an widely accepted policy to
not complain about the orthography used by others (because this raises
flaming and pollutes more). Yes we are horrified, but most will silently
adapt to that fact. After all this is a living language, and languages will
evolve with such simplifications, or initial abuses that will be
integrated sooner or later. As ong as nobody complains against those that
use the standard orthography, and there's still an effort to make it
understood, the orthography will accept some of these simplified forms, if
they don't introduce too much confusion.
We remember the battles about the 1999 French ortthographic reform: it was
highly criticized, but anyway now, people accept it indifferently... except
some administrations that still want to use an official jargon with
standardized words, expressions, and orthographies.


Re: writing in an alphabet with fewer letters: letter replacements

2013-07-05 Thread Martin J. Dürst

On 2013/07/05 17:25, Stephan Stiller wrote:


What I had in mind was more specific: Germans are supposed to convert
[ä,ö,ü,ß] to [ae,oe,ue,ss], though I don't know what's considered
best/legal wrt documents required for entering the US, for example.


I have always used Duerst on plane tickets and the like. On the customs 
form that you have to fill in when entering the US (the green one), I 
always just write Dürst; paper is patient. I have added Durst as an 
additional alternate spelling on a long-term visa application form once, 
just in case.


My impression is that US customs officials are either quite 
knowledgeable or quite tolerant on such issues (or a mixture of both). 
The same applies to customs officials in other countries I have traveled 
to, and other people at airports and such. I guess they get used to 
these cases quite quickly, seeing so many passports each day.


Regards,   Martin.




RE: writing in an alphabet with fewer letters: letter replacements

2013-07-05 Thread Jonathan Rosenne
Google is your friend – some clues:

 

http://www.forbes.com/profile/johanna-sigurdardottir/

 

http://en.wikipedia.org/wiki/Althingi

 

Jony

 

From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Stephan Stiller
Sent: יום ו 05 יולי 2013 11:25
To: unicode@unicode.org
Subject: Re: writing in an alphabet with fewer letters: letter replacements

 

Hi Richard,




I know of standards for transcribing foreign alphabets (by /target/ 
locale – Are they relevant here? If so, which?) [...]

This may well depend on both source and target locale!  How often
will locale have to be broken down on a non-local basis?  Different
newspapers in the same city may have different conventions!

It also depends on the time/era, and sometimes there's just a mess. I recall 
the chaotic variation in the rendering of names of Eastern European composers 
in German (and that of foreign names in general). And I think it's futile to 
try to precisely describe journalistic practice in this domain.

What I had in mind was more specific: Germans are supposed to convert [ä,ö,ü,ß] 
to [ae,oe,ue,ss], though I don't know what's considered best/legal wrt 
documents required for entering the US, for example.

I was thinking that there might be a similar Icelandic tradition of mapping 
non-{A, ..., Z}-letters into the {A, ..., Z}∪punctuation space, for the purpose 
of filling out forms in another country and such. In that regard, I was 
wondering whether any of the numerous transcription schemes that are floating 
around (and are sometimes backed by standardization bodies) play a role here 
and are prescriptive or (if they are not prescriptive) followed to some extent.




For example, there is a Romanian tradition of converting
combining squiggle below to a following 'z'.

squiggle – you're reminding me of that other thread going on right now ;-)




drop all the accents

That (more generally: finding a root / approximation / approximating digraph) 
might be the most common method (my wild guess based on casual observation, and 
it's not exactly a particularly difficult guess to make), but for ð/þ it's not 
clear what people will do. I'll save the list the obvious speculation and let 
someone who knows answer directly.

Stephan



Re: writing in an alphabet with fewer letters: letter replacements

2013-07-05 Thread Stephan Stiller

Hi Jonathan,

I definitely appreciate the partial datapoints from your links, but


Google is your friend

by itself doesn't lead us closer to a real answer, and in this case I 
think that there are at least some good answers, and in any case some 
answers will be better than others.


This reminds me of former South Korean president 이승만 (not exactly a 
sympathetic figure), whose most common English rendering (Syngman 
Rhee) doesn't follow any system of transcription I'm aware of. (For 
Chinese, historical figures seem to be predominantly rendered in pinyin 
now, though I haven't tried to do a thorough check including TW etc, and 
Sun Yat-sen is a famous exception. I think Korean figures mostly follow 
the Revised Romanization now, but Rhee persists and stands out.)


Another interesting case I know is that of a Bhutanese gentleman I met 
in an airport: the name in his passport wasn't listed in the original 
Dzongkha (with Bhutanese Tibetan writing) at all (and nowhere in the 
passport, according to him) but only with Latin letters.


Stephan



RE: writing in an alphabet with fewer letters: letter replacements

2013-07-05 Thread Erkki I Kolehmainen
The fallback for ETH (ð,Ð) is normally d,D and the fallback for THORN (þ,Þ) is 
normally th,Th. 

 

I’m not aware of any authoritative source for all of the fallbacks. Several 
years ago there was a CEN project trying to define the European fallbacks, but 
the project team could not deliver something generally acceptable.

 

Regards, Erkki  

 

Lähettäjä: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] 
Puolesta Stephan Stiller
Lähetetty: 5. heinäkuuta 2013 8:19
Vastaanottaja: Unicode Public
Aihe: writing in an alphabet with fewer letters: letter replacements

 

Hi folks,

For languages whose alphabets aren't too far apart (I'm thinking mostly of the 
set of Latin-derived alphabets), what is a good place for finding out how 
letter replacements for letters that are missing in a different country/locale 
are done?

For example, how will an Icelander normally write his name on a form in a 
foreign country that is lacking ð and þ?

I know of standards for transcribing foreign alphabets (by target locale – Are 
they relevant here? If so, which?), but there must also be popular practices 
(by source locale) that have developed among each locale's residents for 
traveling.

Are there comprehensive resources? If not, are efforts underway?

Stephan



Re: writing in an alphabet with fewer letters: letter replacements

2013-07-05 Thread Stephan Stiller


My impression is that US customs officials are either quite 
knowledgeable or quite tolerant on such issues (or a mixture of both). 
The same applies to customs officials in other countries I have 
traveled to, and other people at airports and such.

Thanks. (And, I don't have the knowledge to agree or disagree.)

I can't resist mentioning the case of Edward Snowden's middle name in 
Hong Kong :-) The issue there was a different one from what I am asking 
about, though, and you never know whether such things actually make a 
difference.


But in general, trying to figure things out and then being consistent 
will be good advice.


Stephan




RE: writing in an alphabet with fewer letters: letter replacements

2013-07-05 Thread Jonathan Rosenne
Hi Stephan,

 

Tell me about it. The official transliteration for Hebrew to the Latin script 
is obsolete, and the situation in this country is a mess.

 

Best regards,

 

Jonathan (Jony) Rosenne

 

From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Stephan Stiller
Sent: יום ו 05 יולי 2013 12:21
To: unicode@unicode.org
Subject: Re: writing in an alphabet with fewer letters: letter replacements

 

Hi Jonathan,

I definitely appreciate the partial datapoints from your links, but



Google is your friend

by itself doesn't lead us closer to a real answer, and in this case I think 
that there are at least some good answers, and in any case some answers will be 
better than others.

This reminds me of former South Korean president 이승만 (not exactly a sympathetic 
figure), whose most common English rendering (Syngman Rhee) doesn't follow 
any system of transcription I'm aware of. (For Chinese, historical figures seem 
to be predominantly rendered in pinyin now, though I haven't tried to do a 
thorough check including TW etc, and Sun Yat-sen is a famous exception. I think 
Korean figures mostly follow the Revised Romanization now, but Rhee persists 
and stands out.)

Another interesting case I know is that of a Bhutanese gentleman I met in an 
airport: the name in his passport wasn't listed in the original Dzongkha (with 
Bhutanese Tibetan writing) at all (and nowhere in the passport, according to 
him) but only with Latin letters.

Stephan



Re: writing in an alphabet with fewer letters: letter replacements

2013-07-05 Thread Stephan Stiller

Hey Jonathan,


The official transliteration for Hebrew to the Latin script is obsolete

What is the latest recommended scheme?


and the situation in this country is a mess
Let me guess: it has to do with the number of spelling variants in names 
of /aliyah/ immigrants? I've always been wondering whether someone named 
Kahn will be spelled כהן as a new citizen of Israel.


Stephan



RE: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)

2013-07-05 Thread Erkki I Kolehmainen
And I'm sorry for having supported you then, since the Romanians claimed at the 
time that they could not live with a font variation, since they needed to be 
able to have a distinction between s and t with cedilla and comma below in the 
same text, only to come up with a national standard with only the comma below. 
Furthermore, the Romanian texts are not at all consistent in this area. 

Sincerely, Erkki

-Alkuperäinen viesti-
Lähettäjä: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] 
Puolesta Michael Everson
Lähetetty: 5. heinäkuuta 2013 12:11
Vastaanottaja: unicode Unicode Discussion
Aihe: Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)

...

I fought this battle back when I supported the Romanian disunification of their 
letters from the Turkish ones. We're just finishing the job now, as far as I 
can see. 

Michael Everson * http://www.evertype.com/







Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)

2013-07-05 Thread Michael Everson
On 5 Jul 2013, at 11:27, Erkki I Kolehmainen e...@iki.fi wrote:

 And I'm sorry for having supported you then, since the Romanians claimed at 
 the time that they could not live with a font variation, since they needed to 
 be able to have a distinction between s and t with cedilla and comma below in 
 the same text,

They do, Erkki, when a text contains both Romanian and Turkish. Each language 
should be represented correctly. 

 only to come up with a national standard with only the comma below.

I don't think we have to worry about that old 8-bit code table at this point. 

 Furthermore, the Romanian texts are not at all consistent in this area. 

No, and the fault lies with 8859. More and more new texts however conform to 
the expected encoding. 

By the way, what we did for the Romanians at the time was to add precomposed Ș 
and ș to the standard in addition to the existing precomposed Ş ş. Note that 
the former could at that time already be supported by the UCS, as S s with 
combining comma below. 

Michael Everson * http://www.evertype.com/





Re: writing in an alphabet with fewer letters: letter replacements

2013-07-05 Thread Ian Clifton
Philippe Verdy verd...@wanadoo.fr writes:

 2013/7/5 Richard Wordingham richard.wording...@ntlworld.com

 I've seen French comments in Fortran that just drop all the
 accents -
 most disconcerting to read!
 This is an old problem. It first appeared because lack of Unicode
 support in famous historic programming languages (and it persists
 today in common languages like C, C++, Basic, Fortran, Cobol, Ada,
 Pascal...), but now it should be noted that many French programmers

This is no longer true of Ada, at least, both the standard and the
popular GNAT implementation support Unicode:

with Ada.Text_Io;
procedure Testu is
   Grüß : constant String := Grüezi mitenand;
begin
   Ada.Text_Io.Put_Line(Grüß);
end Testu;

-- 
Ian ◎




RE: writing in an alphabet with fewer letters: letter replacements

2013-07-05 Thread Jonathan Rosenne
The official transliteration for Hebrew to the Latin script was in my opinion 
is based on German. Thus ו was w etc. It was revised in 2011 but the revised 
version is not in common use.

 

Kahn would normally be קאהן.

 

Here is a press article about it (in Hebrew):

 

http://www.nrg.co.il/online/1/ART1/438/793.html

 

The “up to date” official version may be found at:

http://hebrew-academy.huji.ac.il/hahlatot/TheTranscription/Documents/ATAR1.pdf

 

 

Best regards,

 

Jonathan (Jony) Rosenne

 

From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Stephan Stiller
Sent: יום ו 05 יולי 2013 12:57
To: unicode@unicode.org
Subject: Re: writing in an alphabet with fewer letters: letter replacements

 

Hey Jonathan,




The official transliteration for Hebrew to the Latin script is obsolete

What is the latest recommended scheme?




and the situation in this country is a mess

Let me guess: it has to do with the number of spelling variants in names of 
aliyah immigrants? I've always been wondering whether someone named Kahn will 
be spelled כהן as a new citizen of Israel.

Stephan



ISO 2955

2013-07-05 Thread Dreiheller, Albrecht
A topic that is different but related to the current discussion writing in an 
alphabet with fewer letters: letter replacements
is the question about writing units with limited character sets.
This is not a somehow academical question but a real existing problem in some 
situations.

You might remember there was a standard named
ISO 2955-1983  Information processing -- Representaion of SI and other units 
in systems with limited character sets.
It has given some rules for writing deg for °, u for µ, m2 for m², 
and so on.

However it was withdrawn in 2001.
Does anybody know whether there is a successor standard?
If not, does someone know the reason?
DIN 66030 gives a solution for German, are there similar standards for other 
countries,
especially Japan and China?

Let me briefly outline the context of the question:

Even if ISO 2955 was designed with a focus on systems using ISO 646 (ASCII) as 
character set,
there are certain issues even in Unicode based systems, especially in Japanese 
and Chinese context.

Such a context can briefly described as follows:
- Multi language user interface (language switching is possible)
- Only one font per language, plain text only (no font linking, no font 
fallback, no formatted text)
- East Asian languages use standard third party fonts, like TrueType fonts.
Any kind of embedded devices are good examples.

The simple and often followed rule for East Asian fonts is covering a 
well-known language-specific
encoding standard, like GB2312 or GBK for Simplified Chinese, Shift-JIS for 
Japanese,
EUC-KR or UHC for Korean and BIG5 for Traditional Chinese.
Since these standards contain several alphabets like Cyrillic, Latin, Greek, 
Bopomofo, Kana, and so on,
it is the easiest way for font manufacturers to use the encoding standard as a 
guideline for the
character set coverage even for fonts designed for Unicode based systems.

Now you buy and install a standard font and your Japanese or Chinese user 
interface looks beautiful
on the target system. All are happy.
However, this is true only until you have texts with units like   km·m/s²   or  
µA/mm².
The translation was done on a Unicode and SI basis, but the font covers only 
Shift-JIS, GBK,
or BIG5, respectively.
Thus U+00B2 Superscript Two, U+00B3 Superscript Three and U+00B5 Micro Sign  
are missing in the font
and cannot be displayed. Fonts without these limitations are not easy to obtain.

Thanks in advance for any answer.
Albrecht




Re: Latvian and Marshallese Ad Hoc Report (cedilla and comma below)

2013-07-05 Thread Philippe Verdy
2013/7/5 Michael Everson ever...@evertype.com

 On 5 Jul 2013, at 08:04, Denis Jacquerye moy...@gmail.com wrote:

  The problem is in pretending that a cedilla and a comma below are
 equivalent because in some script fonts in France or Turkey routinely write
 some sort of undifferentiated tick for ç. :-)
 
  Sure they are not equivalent, but stop pretending it is only in some
 script fonts, the page http://typophile.com/node/49347 has plenty of
 examples where it is not in script fonts. In some languages the cedilla can
 have a shape similar to that of a comma, it's a fact.

 Yes, well, if there are non-script fonts which have this feature, it
 nevertheless derived from handwriting. Would any French primer for young
 children routinely use a full-formed C WITH COMMA BELOW C̦ c̦ regularly
 throughout? No. Would readers of Le Monde notice if all the fonts one day
 shifted to C̦ c̦? Of course they would -- and I'd wager €100 they would
 protest, and loudly.

  Any native speaker will tell you the comma-like form and others are
 acceptable.

 Not by any means in all contexts. In genuine taste-tests, Ç ç would be
 universally selected as the more correct form by French users. C̦ c̦
 would not be.


Actually users will only protest if the shape is not attached or displayed
too much below.

But the exact shape does not matter much: an attached vertical tick, or
comma touching the bottom of letters (9-shaped, or )-shaped diagonal
rectangular stroke, or diagonal triangular), or the 5-shaped standard
cedilla will be accepted. it will also be accepted if it's a small mirrored
c, or right half circle, not connected to the base of the letter with some
vertical or diagonal thin stroke.

As long as this is coherent with the general font style and it is clearly
visible and not confused with a dot below. Handwritten French texts
frequently do not use the standard 5-shaped glyph, but some attached
diagonal stroke connected to the center bottom of the c letter.

[notes]
With the exception of untranslated foreign toponyms and of people names or
possible trademarks, only the letters c and C have a cedilla in French, and
most users cannot type the cedilla below the capital letter C with their
standard keyboard layout, e.g. on Windows, MacOS, or Linux, without
complication or without using personal customized layouts.

Word processors or spell correctors for web browsers are proposing the
correction on the frequent word Ça, the most common case where the
cedilla is missing below C.

There also the expression Ç’a which is the contraction of Ça followed
by the auxiliary conjugated verb a, used in the compound past (passé
composé) time of conjugated verbs, but this rare form is avoided by most
users who use the imperfect (imparfait) or simple past (passé simple)
time, i.e. non-compound times, or will use the synonym Cela before the
compound past. (Some more advanced French users will avoid it because of
the phonologic alliteration of Cela a..., where Ç’a ... is still
preferable to respect the correct time matching of sentences and correct
phonology including the contraction; some users are also not using the
contraction, and say or write Ça a...). Many users avoid the difficulty
caused by Ça on their keyboard by writing its synonym Cela.

Other cases for capital C with cedillas only include those where text
written in all-caps styles (not to be abused, but generally limited to
some paragraphs for strong notices, like denials of responsabilities in
contracts and licences, or for strengthening a single word like GARÇON(S)
vs. FILLE(S) within long sentences, but not in isolated cases like data
column headers which should still write Garçon(s) or garçon(s)). Words
containing a cedilla are frequent only because of the word ça, but the
presence of a cedilla outside the word ça is still low in French, most of
them are in conjugated verbs whose infinite ending in -cer like nous
enlaçons)

In other words, the real difficulty of the cedilla in French is to have it
properly displayed below non-initial lowercase letters c, most of these
cases are in conjugated verbs and a few common nouns like garçon(s),
glaçon(s), or less frequent words limaçon(s) and colimaçon(s), plus
some wellknown toponyms like Curaçao (the island in Dutch Antillas, or
the name of an alcohol wellknown for its blue color)... In all these cases,
the shape of the cedilla does not really matter, as long as some some mark
is present below c for correct reading. The initial forms Ca and C’,
instead of the correct forms Ça and Ç’ is frequent, but does not cause a
reading problem when it occurs at the beginning of a sentence.

In fact, this is perceived as a typographic problem more than an
orthographic problem (just like the shape of the apostrophe, which is
preferably the 9-shaped high comma ’ but also absent from keyboards, that
offer only the vertical tick of an encoded ASCII apostrophe-quote).


Re: ISO 2955

2013-07-05 Thread Jukka K. Korpela

2013-07-05 17:01, Dreiheller, Albrecht wrote:


A topic that is different but related to the current discussion writing
in an alphabet with fewer letters: letter replacements
is the question about writing units with limited character sets.
This is not a somehow academical question but a real existing problem in
some situations.


It is, even though many Unicode-minded people would like it to be 
definitely water under the bridge. But the problem is less common than 
many other people think. In most cases, it is a matter of not knowing 
how to type the correct characters. There are many tools for the 
purpose, but no really universal way, so people need to learn specific 
methods, which may vary by situation.


It’s usually not the character repertoire that is limited but people’s 
ability to type characters.



You might remember there was a standard named
ISO 2955-1983  Information processing -- Representaion of SI and other
units in systems with limited character sets.


It seems to be now freely available at
http://freepdfdb.org/pdf/international-standard-68243320.html
Whether that’s legal or not is fairly irrelevant in practice, as the 
standard has historical relevance only.



However it was withdrawn in 2001.
Does anybody know whether there is a successor standard?


There isn’t.


If not, does someone know the reason?


It was not useful. In the rare cases where the repertoire is really 
limited, people will use some ad hoc notations and hopefully explain 
them, or avoid the problem. In text, you can simply write “micrometers” 
if you cannot type “µm”. A standard on such a marginal issue might be 
seen even as misleading: it might make people think that the specific 
notations there are widely known and understood. They aren’t. So if you 
need to use some notations due to lack of some special characters, it is 
better that you realize that there is no generally known method to do 
so, and you realize that you need to explain your notations.


Yucca






Re: writing in an alphabet with fewer letters: letter replacements

2013-07-05 Thread Richard Wordingham
On Thu, 04 Jul 2013 22:19:11 -0700
Stephan Stiller stephan.stil...@gmail.com wrote:

 Hi folks,
 
 For languages whose alphabets aren't too far apart (I'm thinking
 mostly of the set of Latin-derived alphabets), what is a good place
 for finding out how letter replacements for letters that are missing
 in a different country/locale are done?

A good source might be the rules for the name in the first line of the
'machine readable passports'.  Unfortunately, a quick hunt on the
internet failed to find any such rules.

Richard.



Re: writing in an alphabet with fewer letters: letter replacements

2013-07-05 Thread Philippe Verdy
It seems that each country metting a passport has its own national rules
for transliterating people names on passports. They will display the
national alphabet, just extended with some national transliteration rules
for other alphabets (to Basic Latin with few extensions, or using just
letters of Latin alphabets commonly used in regional languages; in Japan or
China, they could possibly use the simplified bopomofo alphabet or kana
syllabaries), or they will ask to people to define and register their own
transliteration in the national registry.

Passports also contain other transliterated items, such as toponyms and
country names (but countries are also encoded or shown at least in English;
European passports contain a few static preprinted texts in a dozen of
languages, using Latin, Greek and Cyrillic alphabets for the national
languages; but not everything is translated). They contain also dates and
numbers, using Western Arabic digits. But official seals will display
anything.

I have absolutely no information about what is encoded in the machine
readable part of my passport, or in accompanying leaflets or stickers
applied on it like visas, or if this data is updated when crossing a
border. But I know that this data contains now some biometric data (more to
come) and a digitally signed photograph and personal signature (the content
of this data is subject to changes, notably because of US demands, some
travelers need a new accompanying form added to their existing passport for
travelling to US or via US, that they'll get when requesting a visa, some
old passports are also refused and need to be changed). Not everything is
in the passport, and travel agencies will also request other information
that will be transmitted before authorizing the trip.

My opinion is that there's no stable standard, countries are changing their
rules every year by mutual negociations or additional national restrictions
or after special international events (they will inform the travel
agencies). Travelers should be informed by travel agencies about the
procedures for visas and their scope (if the visas or national identity
cards are valid across multiple countries within a free travel area whose
member countries apply the same rules).

Some day will come where some countries will request ADN identification
data realized in the origin country and certified by its approved national
labs, or will take sealed ADN samples when entering their country (for
short touristic trips, it may be analyzed later if the person does not
leave the country after expiration of the visa or does not take his return
flight sold by the travel agency, to save costs of labs analysis in the
visited country). But digitaly signed biometric data or photos are
completely out of scope of Unicode.


2013/7/5 Richard Wordingham richard.wording...@ntlworld.com

 On Thu, 04 Jul 2013 22:19:11 -0700
 Stephan Stiller stephan.stil...@gmail.com wrote:

  Hi folks,
 
  For languages whose alphabets aren't too far apart (I'm thinking
  mostly of the set of Latin-derived alphabets), what is a good place
  for finding out how letter replacements for letters that are missing
  in a different country/locale are done?

 A good source might be the rules for the name in the first line of the
 'machine readable passports'.  Unfortunately, a quick hunt on the
 internet failed to find any such rules.

 Richard.




Re: writing in an alphabet with fewer letters: letter replacements

2013-07-05 Thread Richard Wordingham
On Fri, 5 Jul 2013 21:36:24 +0200
Philippe Verdy verd...@wanadoo.fr wrote:

 I have absolutely no information about what is encoded in the machine
 readable part of my passport,

See http://www.icao.int/publications/Documents/9303_p1_v1_cons_fr.pdf ,
especially Appendice 8  (p IV-50).  The English version is available as
http://www.icao.int/publications/Documents/9303_p1_v1_cons_en.pdf ,
especially Appendix 8 (p IV-47).

Richard.



Re: writing in an alphabet with fewer letters: letter replacements

2013-07-05 Thread Stephan Stiller



See http://www.icao.int/publications/Documents/9303_p1_v1_cons_fr.pdf ,
especially Appendice 8  (p IV-50).  The English version is available as
http://www.icao.int/publications/Documents/9303_p1_v1_cons_en.pdf ,
especially Appendix 8 (p IV-47).

I suppose you can't go wrong with what your own passport says :-)

Some obligatory comments:

 * The -XX variants (Ñ → NXX (N) and Ü → UXX (UE)) can't be intended
   for human use.
 * Ŋ → N (shown there with lower-case ŋ – if the implication is that it
   can't occur word-initially, it's not stated and it's not clear all
   the other letters can) is surprising.
 * disallowed: Ä↛A , Ö↛O , Ü↛U  (as are: Å↛A , Ø↛O)


Stephan



Re: writing in an alphabet with fewer letters: letter replacements

2013-07-05 Thread Richard Wordingham
On Fri, 5 Jul 2013 20:49:23 +0100
Richard Wordingham richard.wording...@ntlworld.com wrote:

 The English version is available as
 http://www.icao.int/publications/Documents/9303_p1_v1_cons_en.pdf ,
 especially Appendix 8 (p IV-47).

And there are recommended foldings in Appendix 9!

Richard.



Re: writing in an alphabet with fewer letters: letter replacements

2013-07-05 Thread Stephan Stiller



I suppose you can't go wrong with what your own passport says

On second thought ...


  * disallowed: Ä↛A , Ö↛O , Ü↛U  (as are: Å↛A , Ø↛O)

... I have a Turkish friend for whom it is Ö→O, not OE. This calls into 
question the general applicability of these rules. A few years ago he 
also told me that it's nice that Germany has recommended replacement 
rules because Turkey doesn't. The linked-to document is dated 2006, but 
he told me this after 2006. His knowledge might have been out of date 
(maybe Turkey now does Ö→OE), but in the light of this the extent to 
which these rules reflect popular usage remains very much unclear, and 
we all seem to be agreeing that it'd be unlikely if practice were 
uniform anyways.


Stephan



RE: writing in an alphabet with fewer letters: letter replacements

2013-07-05 Thread Jonathan Rosenne
Latin and Cyrillic? That's it?

Best regards,

Jonathan (Jony) Rosenne


-Original Message-
From: unicode-bou...@unicode.org [mailto:unicode-bou...@unicode.org] On Behalf 
Of Richard Wordingham
Sent: יום ו 05 יולי 2013 22:49
To: Unicode Public
Subject: Re: writing in an alphabet with fewer letters: letter replacements

On Fri, 5 Jul 2013 21:36:24 +0200
Philippe Verdy verd...@wanadoo.fr wrote:

 I have absolutely no information about what is encoded in the machine 
 readable part of my passport,

See http://www.icao.int/publications/Documents/9303_p1_v1_cons_fr.pdf , 
especially Appendice 8  (p IV-50).  The English version is available as 
http://www.icao.int/publications/Documents/9303_p1_v1_cons_en.pdf , especially 
Appendix 8 (p IV-47).

Richard.





Re: writing in an alphabet with fewer letters: letter replacements

2013-07-05 Thread Philippe Verdy
The only relevant part is:
[quote]Élément biométrique vérifiable par machine. Élément physique
d’identification personnelle unique (par
exemple motif de l’iris, empreinte digitale ou caractéristiques faciales),
stocké sur un document de voyage
dans une forme lisible et vérifiable par machine.[/quote]

No more details about the data stored in the magnetic tape or in the RFID
chip. The document only gives info about the printed text readable by OCR,
plus a few other mechanical security systems for building them.

There is also some small data perforated on some pages, no idea if it is
secured or contains something else than a unique ID of the passport itself,
the rest of the data being accessed by computer networks.



2013/7/5 Richard Wordingham richard.wording...@ntlworld.com

 On Fri, 5 Jul 2013 21:36:24 +0200
 Philippe Verdy verd...@wanadoo.fr wrote:

  I have absolutely no information about what is encoded in the machine
  readable part of my passport,

 See http://www.icao.int/publications/Documents/9303_p1_v1_cons_fr.pdf ,
 especially Appendice 8  (p IV-50).  The English version is available as
 http://www.icao.int/publications/Documents/9303_p1_v1_cons_en.pdf ,
 especially Appendix 8 (p IV-47).

 Richard.




Re: writing in an alphabet with fewer letters: letter replacements

2013-07-05 Thread David Starner
On Fri, Jul 5, 2013 at 10:42 AM, Richard Wordingham
richard.wording...@ntlworld.com wrote:
 This is no longer true of Ada, at least, both the standard and the
 popular GNAT implementation support Unicode:

Grüß : constant String := Grüezi mitenand;

 That only demonstrates Latin-1 support!  It seems the current
 Ada standard supports the original ISO 10646 31 bits.

That may only demonstrate Latin-1 support, but full Unicode is still
there. The base library includes one non-ASCII character in the
include files:

   Pi : constant :=
  3.14159_26535_89793_23846_26433_83279_50288_41971_69399_37511;
   π  : constant := Pi;

If you're looking at the current Ada standard, Ada 2012, it explicitly
references ISO/IEC 10646:2011.

--
Kie ekzistas vivo, ekzistas espero.




Re: weltformel.c asking for peer review

2013-07-05 Thread Roman Czyborra
Chere Philippe, merci pour vous observationes, critiques et
recommendationes:

On Fri, Jul 5, 2013 at 8:59 PM, Philippe Verdy verd...@wanadoo.fr wrote:

 Why did it reach the Unicode Sarasvati list ?


Because Unicode is about Finding The Best Representation Of The Universe's
Codes (=Languages).


 If you ask for help about what's wrong about this undocumented program,
 may be you should consider just these loops:

for(w=0;--w;)for(z=0;z--;)for(y=0;y--;)for(x=0;x--;)

 - the w loop will run from -1 to -127 then 128 to 1, so it will not run w=0
 - the z, y, x loops will never run (the initial value is 0, you test it in
 the condition using post-decrementation, so this is the initial value your
 test)


Right, my Heureka was a premature ejeculation due to the compiler
accepting my program as syntactically correct and my psyche's reasoning
feeling overly confident that this was already it.  I got suspicious after
the computation process taking 99% CPU load and a slim 384k for over a
couple of hours, and eventually found and fixed the second problem myself
but not the w=0.

Talking to Doctor http://de.wikipedia.org/wiki/Marc_Benecke who certified
my weltformel seems lupenrein doktorabel to him, I decided to have to
rename w,x,y,z into h,i,j,k honor of http://de,
wikipedia.org/wiki/Plancksches_Wirkungsquantum and
http://en.wikipedia.org/wiki/Quaternion_group and am still pondering
whether to
- go from signed char to unsigned char and just manually work with
2-complement
- go from char to int but count h from 1 to 127 and i,j,k from -127 to 127
- loopholing my miniature universe so that information from 127 propagates
in 1c to -128 even though the real universe seems to need at least
(2^2^256h)^4 if not infinite timespace

In other words, the initialization makes nothing, and leaves all cells to
 their initial value 0 (from calloc) and only the following line sets it
 differently.




b(10,0,0,0)=7; /* urknall! 10 is just random choice */

 This is a case of over-optimization, using some untested assumptions.


Assuming my initialization had succesfully filled with an equal oscillating
load of minuses and pluses, I just wanted to visualize the electromagnetic
vacuum wave before urknall for ten steps.  These could of cause also be
reduced to zero, but I need exactly one mutating urknall to turn the
perfectly ordered harmony into a germ of life and chaos that can evolve and
create biological structures like hydrogen and Brad Pitt and Angelina Jolie.

Use integers in loops, and unsigned chars for your 4GB work buffer
 containing the hypercube:

main(){int w,x,y,z; unsigned char*c=calloc(1L32,sizeof(unsigned
 char));
for(w=256;--w;)for(z=256;--z;)for(y=256;--y;)for(256=0;--x;)b(w,x,y,
 z)=(x+y+z)%2?1:4;

 Anyway this program is unnecessarily slow for computing 6 sets of 256x256
 PPM bitmaps with RGB color space.


I realized I tried to recompute each tomographic slice 64k times which
definitely shows another of the prototypes immaturity.

You can certainly avoid allocating 4GB of memory (many systems have a total
 of 4GB including for the OS and shared memory space for transfering
 textures to video accelerators).


You mean something like a growing cycle realloc(=4);ppm();free();?

You should also avoid creating so many files per directory (the filesystem
 or your shell does not like sorting so many files, the implemetners of web
 browser caches know that and distribute files in separate directories.
 First, you have 6 distinct sets, which could have their own folder, then
 you have 65536 files per set, which you can split into 256 subsets of 256
 files.

 - The files set p(w, x, 256, 256) should be in files wx/w/x.extension
 - The files set p(w, 256, y, 256) should be in files wy/w/y.extension
 - The files set p(w, 256, 256, z) should be in files wy/w/z.extension
 - The files set p(256, x, y, 256) should be in files xy/x/y.extension
 - The files set p(256, x, 256, z) should be in files xz/x/z.extension
 - The files set p(256, 256, y, z) should be in files yz/y/z.extension


Perfect, I love your purely NoSQL model and will incorporate it and credit
you!

This just means creating the 6 subdirectories wx, wy, wz, xy, xz
 and xz, each one containing 256 subdirectories.


And each of these 256 subdirs contain 256 subdirs containing 256 files
containing 256 rows of 256 colors.
They can each further by compressed by ppmlabel -text $filename $filename |
pamenlarge 4 | pnmtopng  $filename.png.
And any 256 further by combining the screenshots into three video/mp4 thru
ppmtompeg $CONFIG  ffmpeg $ARGS

These bitmaps are also over large (PPM files for RGB color space with 1 bit
 per pixel are 8 times larger than necessary, using 1 full storage byte per
 pixel and per color plane, instead of just 1 bit). So each 256x256 bitmap
 takes 192KB + 13 bytes for the magic header. As you store all bitmaps in
 the same current directory, you'll get 256x256*6=393216 files, taking 193KB
 each, i.e. a giant storage space of 

Re: weltformel.c asking for peer review

2013-07-05 Thread Roman Czyborra
Oops, total URL fuckup, here's the the tested errata:

https://en.wikipedia.org/wiki/Mark_Benecke
https://en.wikipedia.org/wiki/Wolfram%27s_2-state_3-symbol_Turing_machine
https://en.wikipedia.org/wiki/Half-integers
https://en.wikipedia.org/wiki/Quaternion_group
https://en.wikipedia.org/wiki/Planck_constant
https://en.wikipedia.org/wiki/Speed_of_light
https://en.wikipedia.org/wiki/Theory_of_everything

Ik schreef:

Chere Philippe, merci pour vous observationes, critiques et
 recommendationes:

 On Fri, Jul 5, 2013 at 8:59 PM, Philippe Verdy verd...@wanadoo.fr wrote:

 Why did it reach the Unicode Sarasvati list ?


 Because Unicode is about Finding The Best Representation Of The Universe's
 Codes (=Languages).


 If you ask for help about what's wrong about this undocumented program,
 may be you should consider just these loops:

for(w=0;--w;)for(z=0;z--;)for(y=0;y--;)for(x=0;x--;)

 - the w loop will run from -1 to -127 then 128 to 1, so it will not run
 w=0
 - the z, y, x loops will never run (the initial value is 0, you test it
 in the condition using post-decrementation, so this is the initial value
 your test)


 Right, my Heureka was a premature ejeculation due to the compiler
 accepting my program as syntactically correct and my psyche's reasoning
 feeling overly confident that this was already it.  I got suspicious after
 the computation process taking 99% CPU load and a slim 384k for over a
 couple of hours, and eventually found and fixed the second problem myself
 but not the w=0.

 Talking to Doctor http://de.wikipedia.org/wiki/Marc_Benecke who certified
 my weltformel seems lupenrein doktorabel to him, I decided to have to
 rename w,x,y,z into h,i,j,k honor of http://de,
 wikipedia.org/wiki/Plancksches_Wirkungsquantum and
 http://en.wikipedia.org/wiki/Quaternion_groupand am still pondering
 whether to
 - go from signed char to unsigned char and just manually work with
 2-complement
 - go from char to int but count h from 1 to 127 and i,j,k from -127 to 127
 - loopholing my miniature universe so that information from 127 propagates
 in 1c to -128 even though the real universe seems to need at least
 (2^2^256h)^4 if not infinite timespace

 In other words, the initialization makes nothing, and leaves all cells to
 their initial value 0 (from calloc) and only the following line sets it
 differently.




b(10,0,0,0)=7; /* urknall! 10 is just random choice */

 This is a case of over-optimization, using some untested assumptions.


 Assuming my initialization had succesfully filled with an equal
 oscillating load of minuses and pluses, I just wanted to visualize the
 electromagnetic vacuum wave before urknall for ten steps.  These could of
 cause also be reduced to zero, but I need exactly one mutating urknall to
 turn the perfectly ordered harmony into a germ of life and chaos that can
 evolve and create biological structures like hydrogen and Brad Pitt and
 Angelina Jolie.

 Use integers in loops, and unsigned chars for your 4GB work buffer
 containing the hypercube:

main(){int w,x,y,z; unsigned char*c=calloc(1L32,sizeof(unsigned
 char));
for(w=256;--w;)for(z=256;--z;)for(y=256;--y;)for(256=0;--x;)b(w,x,y,
 z)=(x+y+z)%2?1:4;

 Anyway this program is unnecessarily slow for computing 6 sets of 256x256
 PPM bitmaps with RGB color space.


 I realized I tried to recompute each tomographic slice 64k times which
 definitely shows another of the prototypes immaturity.

 You can certainly avoid allocating 4GB of memory (many systems have a
 total of 4GB including for the OS and shared memory space for transfering
 textures to video accelerators).


 You mean something like a growing cycle realloc(=4);ppm();free();?

 You should also avoid creating so many files per directory (the filesystem
 or your shell does not like sorting so many files, the implemetners of web
 browser caches know that and distribute files in separate directories.
 First, you have 6 distinct sets, which could have their own folder, then
 you have 65536 files per set, which you can split into 256 subsets of 256
 files.

 - The files set p(w, x, 256, 256) should be in files
 wx/w/x.extension
 - The files set p(w, 256, y, 256) should be in files
 wy/w/y.extension
 - The files set p(w, 256, 256, z) should be in files
 wy/w/z.extension
 - The files set p(256, x, y, 256) should be in files
 xy/x/y.extension
 - The files set p(256, x, 256, z) should be in files
 xz/x/z.extension
 - The files set p(256, 256, y, z) should be in files
 yz/y/z.extension


 Perfect, I love your purely NoSQL model and will incorporate it and credit
 you!

 This just means creating the 6 subdirectories wx, wy, wz, xy, xz
 and xz, each one containing 256 subdirectories.


 And each of these 256 subdirs contain 256 subdirs containing 256 files
 containing 256 rows of 256 colors.
 They can each further by compressed by ppmlabel -text $filename $filename
 | pamenlarge 4 | pnmtopng  $filename.png.
 And any 256 further by combining the screenshots into