date:20180831

Re: UCD in XML or in CSV?

2018-08-31 Thread Marcel Schneider via Unicode

On 31/08/18 19:59 Ken Whistler via Unicode wrote:
[…]
> Second, one of the main obligations of a standards organization is 
> *stability*. People may well object to the ad hoc nature of the UCD data 
> files that have been added over the years -- but it is a *stable* 
> ad-hockery. The worst thing the UTC could do, IMO, would be to keep 
> tweaking formats of data files to meet complaints about one particular 
> parsing inconvenience or another. That would create multiple points of 
> discontinuity between versions -- worse than just having to deal with 
> the ongoing growth in the number of assigned characters and the 
> occasional addition of new data files and properties to the UCD.

I did not want to make trouble asking for moving conventions back and forth.
I liked to learn why UnicodeData.txt was released as a draft without a header 
and nothing, given Unicode knew well in advance that the scheme adopted 
at first release would be kept stable for decades or forever. 

Then I’d like to learn how Unicode came to not devise a consistent scheme
for all the UCD files if any such could be devised, so that people could get 
able to assess whether complaints about inconsistencies are well-founded
or not. It is not enough for me that a given adhockery is stable; IMO it should 
also be well-designed, in responsiveness facing history from a standards body.
That is not what one is telling about UnicodeData.txt, although it is the only 
effectively formatted file in UCD for streamlined processing. Was there not 
enough time to think about a header line and a file header? With the header 
line it would be flexible, and all the problems would be solved if specifying 
that parsers should start with counting the field number prior to creating 
storage arrays. We are lacking a real history of Unicode, explaining why 
everybody was in a hurry. “Authors falling like flies” is the only hint that 
comes to mind.

And given Unicode appear to have missed the hit, to discuss whether it 
would be time to add a more accomplished file for better usability.

> 
> Keep in mind that there is more to processing the UCD than just 
> "latest". People who just focus on grabbing the very latest version of 
> the UCD and updating whatever application they have are missing half the 
> problem. There are multiple tools out there that parse and use multiple 
> *versions* of the UCD. That includes the tooling that is used to 
> maintain the UCD (which parses *all* versions), and the tooling that 
> creates UCD in XML, which also parses all versions. Then there is 
> tooling like unibook, to produce code charts, which also has to adapt to 
> multiple versions, and bidi reference code, which also reads multiple 
> versions of UCD data files. Those are just examples I know off the top 
> of my head. I am sure there are many other instances out there that fit 
> this profile. And none of the applications already built to handle 
> multiple versions would welcome having to permanently build in tracking 
> particular format anomalies between specific versions of the UCD.

That point is clear to me, and even when suggesting to make changes to
BidiMirrored.txt, I had alternatives with a stable existing file and a new 
enhanced file. But what is totally unclear to me is what are old versions 
doing in compiling latest data. Delta is OK, research on particular topic in
old data is OK, but what does it mean to need to parse *all* versions to 
get newest products?
> 
> Third, please remember that folks who come here complaining about the 
> complications of parsing the UCD are a very small percentage of a very 
> small percentage of a very small percentage of interested parties. 
> Nearly everybody who needs UCD data should be consuming it as a 
> secondary source (e.g. for reference via codepoints.net), or as a 
> tertiary source (behind specialized API's, regex, etc.), or as an end 
> user (just getting behavior they expect for characters in applications). 
> Programmers who actually *need* to consume the raw UCD data files and 
> write parsers for them directly should actually be able to deal with the 
> format complexity -- and, if anything, slowing them down to make them 
> think about the reasons for the format complexity might be a good thing, 
> as it tends to put the lie to the easy initial assumption that the UCD 
> is nothing more than a bunch of simple attributes for all the code points.

That makes no sense to me. UCD raw data is and remains a primary source,
I see no way to consume it as a secondary source or as a tertiary source. 
Do you mean to consume it via secondary or tertiary sources? Then we 
actually appear to consume those sources instead of UCD raw data.
These sources are fine for the purpose of getting information about some 
particular code points, but most of these tools I remember don’t allow to 
filter values and compute overviews, nor to add data, as we can do it 
in spreadsheet software. Honestly are we so few people

Re: Private Use areas

2018-08-31 Thread William_J_G Overington via Unicode

Hi

I have now found the following document.

http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-404.pdf

William Overington

Friday 31 August 2018



Original message
>From : wjgo_10...@btinternet.com
Date : 2018/08/31 - 21:43 (GMTDT)
To : m...@kli.org, unicode@unicode.org
Subject : Re: Private Use areas

Hi

Thank you for your posts from earlier today.

Actually I learned about JSON yesterday and I am thinking that using JSON could 
well be a good idea.

I found a helpful page with diagrams.

http://www.json.org/

Although I hope that a format of recording information about the properties of 
particular uses of Private Use Area characters will become implemented as a 
practicality, and that that format can be applied in practice where desired, 
and indeed I would be happy to participate in a group project, I do not know 
enough about Unicode properties to play a major role or to lead such a project.

William Overington

Friday 31 August 2018

Re: Private Use areas

2018-08-31 Thread William_J_G Overington via Unicode

Hi

Thank you for your posts from earlier today.

Actually I learned about JSON yesterday and I am thinking that using JSON could 
well be a good idea.

I found a helpful page with diagrams.

http://www.json.org/

Although I hope that a format of recording information about the properties of 
particular uses of Private Use Area characters will become implemented as a 
practicality, and that that format can be applied in practice where desired, 
and indeed I would be happy to participate in a group project, I do not know 
enough about Unicode properties to play a major role or to lead such a project.

William Overington

Friday 31 August 2018

Re: Private Use areas

2018-08-31 Thread Mark E. Shoulson via Unicode


On 08/28/2018 04:26 AM, William_J_G Overington via Unicode wrote:

Hi
  
Mark E. Shoulson wrote:
  

I'm not sure what the advantage is of using circled characters instead of plain 
old ascii.
  
My thinking is that "plain old ascii" might be used in the text encoded in the file. Sometimes a file containing Private Use Area characters is a mix of regular Unicode Latin characters with just a few Private Use Area characters mixed in with them. So my suggestion of using circled characters is for disambiguation purposes. The circled characters in the PUAINFO sequence would not be displayed if a special software program were being used to read in the text file, then act upon the information that is encoded using the circled characters.


What if circled characters are used in the text encoded in the file?  
They're characters too, people use them and all.  Whenever you designate 
some characters to be used in a way outside their normal meaning, you 
have the problem of how to use them *with* their normal meaning.  So 
there are various escaping schemes and all.  So in XML, all characters 
have their normal meanings—except <, >, and &, which mean something 
special and change the interpretations of other nearby characters (so 
"bold" is a word in English that appears in the text, but "" is 
part of an instruction to the renderer that doesn't appear in the 
text.)  And the price is that those three characters have to be 
expressed differently (  ).  I don't really see what you 
gain by branding some large swath of unicode ("circled characters") as 
"special" and not meaning their usual selves, and for that matter making 
these hard-to-type characters *necessary* for using your scheme, when 
you could do something like what XML does, and say "everything between < 
and > is to be interpreted specially, and there, these characters have 
the following meanings" and then have some other way of expressing those 
two reserved characters.  (not saying you need to do it XML's way, but 
something like that: reserve a small number of characters that have to 
be escaped, not some huge chunk.)
  
My thinking is that using this method just adds some encoded information at the start of the text file and does not require the whole document to become designated as a file conformant to a particular markup format.


That's another way of saying that this is a markup format which accepts 
a large variety of plain texts.  Because you ARE talking about making a 
"particular markup format," just a different and new one.


I guess there's not even any reason for me to argue the point, though, 
since it is up to you how to design your markup language, and you can 
take advice (or not) from anyone you like.  Draw up some design, find 
some interested people, start a discussion, and work it out.  (but not 
here; this list is for discussing Unicode.)


~mark

Re: Private Use areas

2018-08-31 Thread Mark E. Shoulson via Unicode


On 08/28/2018 11:58 AM, William_J_G Overington via Unicode wrote:

Asmus Freytag wrote:


There are situations where an ad-hoc markup language seems to fulfill a need that is not 
well served by the existing full-fledged markup languages. You find them in internet 
"bulletin boards" or services like GitHub, where pure plain text is too 
restrictive but the required text styles purposefully limited - which makes the syntactic 
overhead of a full-featured mark-up language burdensome.

I am thinking of such an ad-hoc special purpose markup language.

I am thinking of something like a special purpose version of the FORTH computer 
language being used but with no user definitions, no comparison operations and 
no loops and no compiler. Just a straight run through as if someone were typing 
commands into FORTH in interactive mode at a keyboard. Maybe no need for spaces 
between commands. For example, circled R might mean use Right-to-left text 
display.


That starts to sound no longer "ad-hoc", but that is not a well-defined 
term anyway.  You're essentially describing a special-purpose markup 
language or protocol, or perhaps even programming language.  Which is 
quite reasonable; you should (find some other interested people and) 
work out some of  the details and start writing up parsers and such

I am thinking that there could be three stacks, one for code points and one for 
numbers and one for external reference strings such as for accessing a web page 
or a PDF (Portable Document Format) document or listing an International 
Standard Book Number and so on. Code points could be entered by circled H 
followed by circled hexadecimal characters followed by a circled character to 
indicate Push onto the code point stack. Numbers could be entered in base 10, 
followed by a circled character to mean Push onto the number stack. A later 
circled character could mean to take a certain number of code points (maybe 
just 1, or maybe 0) from the character stack and a certain number of numbers 
(maybe just 1, or maybe just 0) from the number stack and use them to set some 
property.

It could all be very lightweight software-wise, just reading the characters of 
the sequence of circled characters and obeying them one by one just one time 
only on a single run through, with just a few, such as the circled digits, each 
having its meaning dependent upon a state variable such as, for a circled 
digit, whether data entry is currently hexadecimal or base 10.


I still don't see why you're fixated on using circled characters. You're 
already dealing with a markup-language type setup, why not do what other 
markup schemes do?  You reserve three or four characters and use them to 
designate when other characters are not being used in their normal sense 
but are being used as markup.  In XML, when characters are inside '<>' 
tags, they are not "plain text" of the document, but they mean other 
things—perhaps things like "right-to-left" or "reference this web page" 
and so forth, which are exactly the kinds of things you're talking about 
here.  If you don't want to use plain ascii characters because then you 
couldn't express plain ascii in your text, you're left with exactly the 
same problem with circled characters: you can't express circled 
characters in your text.  While that is a smaller problem, it can be 
eliminated altogether by various schemes used by XML or RTF or 
lightweight markup languages.  Reserve a few special characters to give 
meanings to the others, and arrange for ways to escape your handful of 
reserved characters so you can express them.  More straightforward to 
say "you have to escape <, >, and & characters" than to say "you have to 
escape all circled characters."


Anyway, this is clearly a whole new high-level protocol you need (or 
want) to work out, which would *use* Unicode (just like XML and JSON 
do), but doesn't really affect or involve it (Unicode is all about the 
"plain text".  Kind of getting off-topic, but get some people interested 
and start a mailing list to discuss it.  Good luck!


~mark

Re: UCD in XML or in CSV?

2018-08-31 Thread Ken Whistler via Unicode





On 8/31/2018 1:36 AM, Manuel Strehl via Unicode wrote:

For codepoints.net I use that data to stuff everything in a MySQL
database.


Well, for some sense of "everything", anyway. ;-)

People having this discussion should keep in mind a few significant points.

First, the UCD proper isn't "everything", extensive as it is. There are 
also other significant sets of data that the UTC maintains about 
characters in other formats, as well, including the data files 
associated with UTS #46 (IDNA-related), UTS #39 (confusables mapping, 
etc.), UTS #10 (collation), UTR #25 (a set of math-related property 
values), and UTS #51 (emoji-related). The emoji-related data has now 
strayed into the CLDR space, so a significant amount of the information 
about emoji characters is now carried as CLDR tags. And then there is 
various other information about individual characters (or small sets of 
characters) scattered in the core spec -- some in tables, some not, as 
well as mappings to dozens of external standards. There is no actual 
definition anywhere of what "everything" actually is. Further, it is a 
mistake to assume that every character property just associates a simple 
attribute with a code point. There are multiple types of mappings, 
complex relational and set properties, and so forth.


The UTC attempts to keep a fairly clear line around what constitutes the 
"UCD proper" (including Unihan.zip), in part so that it is actually 
possible to run the tools that create the XML version of the UCD, for 
folks who want to consume a more consistent, single-file format version 
of the data. But be aware that that isn't everything -- nor would there 
be much sense in trying to keep expanding the UCD proper to actually 
represent "everything" in one giant DTD.


Second, one of the main obligations of a standards organization is 
*stability*. People may well object to the ad hoc nature of the UCD data 
files that have been added over the years -- but it is a *stable* 
ad-hockery. The worst thing the UTC could do, IMO, would be to keep 
tweaking formats of data files to meet complaints about one particular 
parsing inconvenience or another. That would create multiple points of 
discontinuity between versions -- worse than just having to deal with 
the ongoing growth in the number of assigned characters and the 
occasional addition of new data files and properties to the UCD.


Keep in mind that there is more to processing the UCD than just 
"latest". People who just focus on grabbing the very latest version of 
the UCD and updating whatever application they have are missing half the 
problem. There are multiple tools out there that parse and use multiple 
*versions* of the UCD. That includes the tooling that is used to 
maintain the UCD (which parses *all* versions), and the tooling that 
creates UCD in XML, which also parses all versions. Then there is 
tooling like unibook, to produce code charts, which also has to adapt to 
multiple versions, and bidi reference code, which also reads multiple 
versions of UCD data files. Those are just examples I know off the top 
of my head. I am sure there are many other instances out there that fit 
this profile. And none of the applications already built to handle 
multiple versions would welcome having to permanently build in tracking 
particular format anomalies between specific versions of the UCD.


Third, please remember that folks who come here complaining about the 
complications of parsing the UCD are a very small percentage of a very 
small percentage of a very small percentage of interested parties. 
Nearly everybody who needs UCD data should be consuming it as a 
secondary source (e.g. for reference via codepoints.net), or as a 
tertiary source (behind specialized API's, regex, etc.), or as an end 
user (just getting behavior they expect for characters in applications). 
Programmers who actually *need* to consume the raw UCD data files and 
write parsers for them directly should actually be able to deal with the 
format complexity -- and, if anything, slowing them down to make them 
think about the reasons for the format complexity might be a good thing, 
as it tends to put the lie to the easy initial assumption that the UCD 
is nothing more than a bunch of simple attributes for all the code points.


--Ken

Re: CLDR (was: Private Use areas)

2018-08-31 Thread Marcel Schneider via Unicode

On 31/08/18 07:27 Janusz S. Bień via Unicode wrote:
[…]
> > Given NamesList.txt / Code Charts comments are kept minimal by design, 
> > one couldn’t simply pop them into XML or whatever, as the result would be 
> > disappointing and call for completion in the aftermath. Yet another task 
> > competing with CLDR survey.
> 
> Please elaborate. It's not clear for me what do you mean.

These comments are designed for the Code Charts and as such must not be
disproportionate in exhaustivity. Eg we have lists of related languages ending 
in an ellipsis. Once this is popped into XML, ie extracted from NamesList.txt
to be fed in an extensible and unconstrained format (without any constraint 
as of available space, number and length of comments, and so on), any lack 
is felt as a discriminating neglect, and there will be a huge rush adding data.
Yet Unicode hasn’t set up products where that data could be published, ie not 
in the Code Charts (for the abovementioned reason), not in ICU so far as the 
additional information involved does not match a known demand on user side 
(localizing software does not mean providing scholarly exhaustive information
about supported characters). The use will be in character pickers providing 
every available information about a given character. That is why Unicode is
to prioritize CLDR for CLDR users, rather than extra information for the web.

> 
> > Reviewing CLDR data is IMO top priority.
> > There are many flaws to be fixed in many languages including in English.
> > A lot of useful digest charts are extracted from XML there,
> 
> Which XML? where?

More precisely it is LDML, the CLDR-specific XML.
What I called “digest charts” are the charts found here:

http://www.unicode.org/cldr/charts/34/

The access is via this page:

http://cldr.unicode.org/index/downloads

where the charts are in the Charts column, while the raw data is under SVN Tag.

> 
> > and we really 
> > need to go through the data and correct the many many errors, please.
> 
> Some time ago I tried to have a close look at the Polish locale and
> found the CLDR site prohibitively confusing.

I experienced some trouble too, mainly because "SVN Tag" is counter-intuitive 
for the access to the XML data (except when knowing about SubVersioN).
Polish data is found here:

https://www.unicode.org/cldr/charts/34/summary/pl.html

The access is via the top of the "Summary" index page (showing root data):

https://www.unicode.org/cldr/charts/34/summary/root.html

You may wish to particularly check the By-Type charts:

https://www.unicode.org/cldr/charts/34/by_type/index.html

Here I’d suggest to first focus on alphabetic information and on punctuation.

https://www.unicode.org/cldr/charts/34/by_type/core_data.alphabetic_information.punctuation.html

Under Latin (table caption, without anchor) we find out what punctuation 
Polish has compared to other locales using the same script.
The exact character appears when hovering the header row.
Eg U+2011 NON-BREAKING HYPHEN is systematically missing, which is 
an error in almost every locale using hyphen. TC is about to correct that.

Further you will see that while Polish is using apostrophe
https://slowodnia.tumblr.com/post/136492530255/the-use-of-apostrophe-in-polish
CLDR does not have the correct apostrophe for Polish, as opposed eg to French.
You may wish to note that from now on, both U+0027 APOSTROPHE and 
U+0022 QUOTATION MARK are ruled out in almost all locales, given the 
preferred characters in publishing are U+2019 and, for Polish, the U+201E and 
U+201D that are already found in CLDR pl.

Note however that according to the information provided by English Wikipedia:
https://en.wikipedia.org/wiki/Quotation_mark#Polish
Polish also uses single quotes, that by contrast are still missing in CLDR.

Now you might understand what I meant when pointing that there are still 
many errors in many languages in CLDR, including in English.

Best regards,

Marcel

> 
> Best regards
> 
> Janusz
> 
> -- 
> , 
> Janusz S. Bien
> emeryt (emeritus)
> https://sites.google.com/view/jsbien
> 
>

Re: UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20)

2018-08-31 Thread Manuel Strehl via Unicode

To handle the UCD XML file a streaming parser like Expat is necessary.

For codepoints.net I use that data to stuff everything in a MySQL
database. If anyone is interested, the code for that is Open Source:

https://github.com/Codepoints/unicode2mysql/

The example for handling the large XML file can be found here:

https://github.com/Codepoints/unicode2mysql/blob/master/bin/ucd_to_sql.py

For me it's currently much easier to have all the data in a single
place, e.g. a large XML file, than spread over a multitude of files
_with different ad-hoc syntaxes_.

The situation would possibly be different, though, if the UCD data
would be split in several files of the same format. (Be it JSON, CSV,
YAML, XML, TOML, whatever. Just be consistent.)

Nota bene: That is also true for the emoji data, which consists as of
now of five plain text files with similar but not identical formats.

Cheers,
Manuel
Am Fr., 31. Aug. 2018 um 08:19 Uhr schrieb Marius Spix via Unicode
:
>
> A good compromise between human readability, machine processability and
> filesize would be using YAML.
>
> Unlike JSON, YAML supports comments, anchors and references, multiple
> documents in a file and several other features.
>
> Regards,
>
> Marius Spix
>
>
> On Fri, 31 Aug 2018 06:58:37 +0200 (CEST) Marcel Schneider via Unicode
> wrote:
>
> > On 30/08/18 23:34 Philippe Verdy via Unicode wrote:
> > >
> > > Welel an alternative to XML is JSON which is more compact and
> > > faster/simpler to process;
> >
> > Thanks for pointing the problem and the solution alike. Indeed the
> > main drawback of the XML format of UCD is that it results in an
> > “insane” filesize. “Insane” was applied to the number of semicolons
> > in UnicodeData.txt, but that is irrelevant. What is really insane is
> > the filesize of the XML versions of the UCD. Even without Unihan, it
> > may take up to a minute or so to load in a text editor.
> >
> > > however JSON has no explicit schema, unless the schema is being
> > > made part of the data itself, complicating its structure (with many
> > > levels of arrays of arrays, in which case it becomes less easy to
> > > read by humans, but more adapted to automated processes for fast
> > > processing).
> > >
> > > I'd say that the XML alone is enough to generate any JSON-derived
> > > dataset that will conform to the schema an application expects to
> > > process fast (and with just the data it can process, excluding
> > > various extensions still not implemetned). But the fastest
> > > implementations are also based on data tables encoded in code (such
> > > as DLL or Java classes), or custom database formats (such as
> > > Berkeley dB) generated also automatically from the XML, without the
> > > processing cost of decompression schemes and parsers.
> > >
> > > Still today, even if XML is not the usual format used by
> > > applications, it is still the most interoperable format that allows
> > > building all sorts of applications in all sorts of languages: the
> > > cost of parsing is left to an application builder/compiler.
> >
> > I’ve tried an online tool to get ucd.nounihan.flat.xml converted to
> > CSV. The tool is great and offers a lot of options, but given the
> > “insane” file size, my browser was up for over two hours of trouble
> > until I shut down the computer manually. From what I could see in the
> > result field, there are many bogus values, meaning that their
> > presence is useless in the tags of most characters. And while many
> > attributes have cryptic names in order to keep the file size minimal,
> > some attributes have overlong values, ie the design is inconsistent.
> > Eg in every character we read: jg="No_Joining_Group" That is bogus.
> > One would need to take them off the tags of most characters, and even
> > in the characters where they are relevant, the value would be simply
> > "No". What’s the use of abbreviating "Joining Group" to "jg" in the
> > atribute name if in the value it is written out? And I’m quoting from
> > U+. Further many values are set to a crosshatch, instead of
> > simply being removed from the characters where they are empty. Then
> > the many instances of "undetermined script" resulting in *two*
> > attribues with "Zyyy" value. Then in almost each character we’re told
> > that it is not a whitespace, not a dash, not a hyphen, and not a
> > quotation mark: Dash="N" WSpace="N" Hyphen="N" QMark="N" One couldn’t
> > tell that UCD does actually benefit from the flexibility of XML,
> > given that many attributes are systematically present even where they
> > are useless. Perhaps ucd-*.xml would be two thirds, half, or one
> > third their actual size if they were properly designed.
> >
> > > Some apps embed the compilers themselves and use a stored cache for
> > > faster processing: this approach allows easy updates by detecting
> > > changes in the XML source, and then downloading them.
> > >
> > > But in CLDR such updates are generally not automated : the general
> > > scheme

Re: CLDR (was: Private Use areas)

2018-08-31 Thread Manuel Strehl via Unicode

The XML files in these folders:

https://unicode.org/repos/cldr/tags/latest/common/

But I agree. I spent an extreme amount of time to get somewhat used to
cldr.unicode.org and and the data repo, and still I have no clue,
where to find a concrete piece of information without digging into the
site.
Am Fr., 31. Aug. 2018 um 07:22 Uhr schrieb Janusz S. Bień via Unicode
:
>
> On Thu, Aug 30 2018 at  2:27 +0200, unicode@unicode.org writes:
>
> [...]
>
> > Given NamesList.txt / Code Charts comments are kept minimal by design,
> > one couldn’t simply pop them into XML or whatever, as the result would be
> > disappointing and call for completion in the aftermath. Yet another task
> > competing with CLDR survey.
>
> Please elaborate. It's not clear for me what do you mean.
>
> > Reviewing CLDR data is IMO top priority.
> > There are many flaws to be fixed in many languages including in English.
> > A lot of useful digest charts are extracted from XML there,
>
> Which XML? where?
>
> > and we really
> > need to go through the data and correct the many many errors, please.
>
> Some time ago I tried to have a close look at the Polish locale and
> found the CLDR site prohibitively confusing.
>
> Best regards
>
> Janusz
>
> --
>  ,
> Janusz S. Bien
> emeryt (emeritus)
> https://sites.google.com/view/jsbien
>

Re: UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20)

2018-08-31 Thread Marius Spix via Unicode

A good compromise between human readability, machine processability and
filesize would be using YAML.

Unlike JSON, YAML supports comments, anchors and references, multiple
documents in a file and several other features.

Regards,

Marius Spix


On Fri, 31 Aug 2018 06:58:37 +0200 (CEST) Marcel Schneider via Unicode
wrote:

> On 30/08/18 23:34 Philippe Verdy via Unicode wrote:
> >
> > Welel an alternative to XML is JSON which is more compact and
> > faster/simpler to process;
> 
> Thanks for pointing the problem and the solution alike. Indeed the
> main drawback of the XML format of UCD is that it results in an
> “insane” filesize. “Insane” was applied to the number of semicolons
> in UnicodeData.txt, but that is irrelevant. What is really insane is
> the filesize of the XML versions of the UCD. Even without Unihan, it
> may take up to a minute or so to load in a text editor.
> 
> > however JSON has no explicit schema, unless the schema is being
> > made part of the data itself, complicating its structure (with many
> > levels of arrays of arrays, in which case it becomes less easy to
> > read by humans, but more adapted to automated processes for fast
> > processing).
> >
> > I'd say that the XML alone is enough to generate any JSON-derived
> > dataset that will conform to the schema an application expects to
> > process fast (and with just the data it can process, excluding
> > various extensions still not implemetned). But the fastest
> > implementations are also based on data tables encoded in code (such
> > as DLL or Java classes), or custom database formats (such as
> > Berkeley dB) generated also automatically from the XML, without the
> > processing cost of decompression schemes and parsers.
> >
> > Still today, even if XML is not the usual format used by
> > applications, it is still the most interoperable format that allows
> > building all sorts of applications in all sorts of languages: the
> > cost of parsing is left to an application builder/compiler.
> 
> I’ve tried an online tool to get ucd.nounihan.flat.xml converted to
> CSV. The tool is great and offers a lot of options, but given the
> “insane” file size, my browser was up for over two hours of trouble
> until I shut down the computer manually. From what I could see in the
> result field, there are many bogus values, meaning that their
> presence is useless in the tags of most characters. And while many
> attributes have cryptic names in order to keep the file size minimal,
> some attributes have overlong values, ie the design is inconsistent.
> Eg in every character we read: jg="No_Joining_Group" That is bogus.
> One would need to take them off the tags of most characters, and even
> in the characters where they are relevant, the value would be simply
> "No". What’s the use of abbreviating "Joining Group" to "jg" in the
> atribute name if in the value it is written out? And I’m quoting from
> U+. Further many values are set to a crosshatch, instead of
> simply being removed from the characters where they are empty. Then
> the many instances of "undetermined script" resulting in *two*
> attribues with "Zyyy" value. Then in almost each character we’re told
> that it is not a whitespace, not a dash, not a hyphen, and not a
> quotation mark: Dash="N" WSpace="N" Hyphen="N" QMark="N" One couldn’t
> tell that UCD does actually benefit from the flexibility of XML,
> given that many attributes are systematically present even where they
> are useless. Perhaps ucd-*.xml would be two thirds, half, or one
> third their actual size if they were properly designed.
> 
> > Some apps embed the compilers themselves and use a stored cache for
> > faster processing: this approach allows easy updates by detecting
> > changes in the XML source, and then downloading them.
> >
> > But in CLDR such updates are generally not automated : the general
> > scheme evolves over time and there are complex dependencies to
> > check so that some data becomes usable
> 
> Should probably read *un*usable.
> 
> > (frequently you need to implement some new algorithms to follow the
> > processing rules documented in CLDR, or to use data not completely
> > validated, or to allow aplicatioçns to provide their overrides from
> > insufficiently complete datasets in CLDR, even if CLDR provides a
> > root locale and applcaitions are supposed to follow the BCP47
> > fallback resolution rules; applciations also have their own need
> > about which language codes they use or need, and CLDR provides many
> > locales that many applications are still not prepared to render
> > correctly, and many application users complain if an application is
> > partly translated and contains too many fallbacks to another
> > language, or worse to another script).
> 
> So the case is even worse than what I could see when looking into
> CLDR. Many countries, including France, don’t care about the data of
> their own locale in CLDR, but I’m not going to vent about that on
> Unicode Public, because that

Re: UCD in XML or in CSV?

Re: Private Use areas

Re: Private Use areas

Re: Private Use areas

Re: Private Use areas

Re: UCD in XML or in CSV?

Re: CLDR (was: Private Use areas)

Re: UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20)

Re: CLDR (was: Private Use areas)

Re: UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20)

10 matches

Site Navigation

Mail list logo

Footer information