Re: UCD in XML or in CSV? (is: UCD data consumption)

2018-09-01 Thread Marcel Schneider via Unicode
I’m not responding without thinking, as I was blamed of when I did,
but it is painful for me to dig into what Ken explained about how 
we should be consuming UCD data. I’ll now try to get some more clarity
into the topic.

> On 31/08/18 19:59 Ken Whistler via Unicode wrote:
> […]
> > 
> > Third, please remember that folks who come here complaining about the 
> > complications of parsing the UCD are a very small percentage of a very 
> > small percentage of a very small percentage of interested parties. 

OK, among avg. 700 list subscribers, relatively few are ever complaining 
about anything, let alone about this particular topic. But we should always 
keep in mind that many folks out there complaining about Unicode don’t come 
here to do so.

> > Nearly everybody who needs UCD data should be consuming it as a 
> > secondary source (e.g. for reference via codepoints.net), or as a 
> > tertiary source (behind specialized API's, regex, etc.),

Like already suggested, “as” should probably read “via” in that part.

> > or as an end 
> > user (just getting behavior they expect for characters in applications). 

That is more than a simple statement about who is consuming UCD data which 
way, as you say “should.” There seem to be assumptions that it is discouraged 
to dive into the raw data; that folks reading file headers are not doing well;
that the data should be assembled only in certain ways; and that ignorant 
people shouldn’t open the UCD cupboard to pick a file they deem useful.

If so, then it might be surprising to know that when submitting a proposal
about Bidi-mirroring mathematical symbols issues feedback
http://www.unicode.org/L2/L2017/17438-bidi-math-fdbk.html
I’d started as a quasi-end-user not getting behavior I expected for characters 
in browsers, as I was spotting characters bidi-mirrored by glyph exchange, like
it is implemented in web browsers, because I wanted that end-users could 
experience bidi-mirroring as it works. Unexpectedly a number of math symbols 
did not mirror, despite many of them being even scalar neighbors.

> > Programmers who actually *need* to consume the raw UCD data files and 
> > write parsers for them directly should actually be able to deal with the 
> > format complexity -- and, if anything, slowing them down to make them 
> > think about the reasons for the format complexity might be a good thing, 

I can see one main reason for the format complexity, and that is that data 
from various propeties don’t necessarily telescope the same way to make for 
small files. The complexity of UCD would then mainly be self-induced by the
way of packing data into one small file per property rather than adding the
value to each relevant code point in one large list as is UnicodeData.txt.

While I’m now taking the time to write this up because I’m committed to 
process that information, we can think of many many people who don’t like 
to be slowed down trying to find out why Unicode changed UCD design while 
following the original idea of a large CSV list would be straightforward, 
eventually by setting up a new one if the first one got stuck. What I can 
figure out is that while a new property was added, that particular property 
was always thought of as being the last one. 
(At some point the many files were then dumped into the known XML files.)

If UCD is to be made of small files, it is necessarily complex, and the 
conclusion is that there should be another large CSV grid to make things 
simple again and lightweight alike so far as they can.

> > as it tends to put the lie to the easy initial assumption that the UCD 
> > is nothing more than a bunch of simple attributes for all the code points.

Did you try the sentence when taking off “simple”? It appears to me as not 
being a lie then. One attribute comes to mind that is so complex that its 
design even changed over time, despite Unicode’s commitment to stability.
The Bidi_Mirrored_Glyph property was originally designed to include “best-fit”
pairs for least-worse display in applications not supporting RTL glyphs 
(ie without OpenType support), with legibility of math formulae in mind.
Later (probably due to a poorly written OpenType spec), no more best-fit pairs 
were added to BidiMirroring.txt, as if OpenType implementers weren’t to remove
the best-fit pairs anyway prior to using the file (while the spec says to use 
it as-is). That led then to the display problem pointed above.

I’m sparing the particular problem related to 3 pairs of symbols with tilde, 
nor to the missing Bidi_Mirroring_Type property, given UTC was not interested. 

So you can understand that I’m not unaware of the complexity of UCD. Though
I don’t think that this could be an argument for not publishing a medium-size 
CSV file with scalar values listed as in UnicodeData.txt.

> 
> […]
> Even Excel Starter, that I have, is a great tool helping
> to perform tasks I fail to get with other tools, even spreadsheet software.

Ie not every spreadsheet 

Re: UCD in XML or in CSV? (is: Parsing UCD in XML)

2018-09-01 Thread Marcel Schneider via Unicode
On 31/08/18 10:47 Manuel Strehl via Unicode wrote:
> 
> To handle the UCD XML file a streaming parser like Expat is necessary.

Thanks for the tip. However for my needs, Expat looks like overkill, and I’m 
looking out for a much simpler standalone tool, just converting XML to CSV.

> 
> For codepoints.net I use that data […]

Very good site IMO, as it compiles a lot of useful information trying to 
maximize 
human readability. 

Nice to have added the Adopt-a-character button, too.

Thanks,

Marcel



Re: UCD in XML or in CSV? (is: UCD in YAML)

2018-09-01 Thread Marcel Schneider via Unicode
Thank you Marius for the example. Indeed I now see that YAML is a powerful means
for a file to have an intuitive readability while drastically reducing file 
size.

BTW what I conjectured about the role of line breaks is true for CSV too, and 
any file
downloaded from UCD on a semicolon separator basis becomes unusable when 
displayed straight in the built-in text editor of Windows, given Unicode uses 
Unix EOL.

 Still for use in spreadsheets, YAML needs to be converted to CSV, although 
that 
might not crash the browser as large XML does.

Regards,

Marcel

On 01/09/18 09:18 Marius Spix via Unicode wrote:
> 
> Hello Marcel,
> 
> YAML supports references, so you can refer to another character’s
> properties.
> 
> Example:
> 
> repertoire: 
> char:
> -
> name_alias: 
> - [NUL,abbreviation]
> - ["NULL",control]
> cp: 
> na1: "NULL"
> props: &
> age: "1.1"
> na: ""
> JSN: ""
> gc: Cc
> ccc: 0
> dt: none
> dm: "#"
> nt: None
> nv: NaN
> bc: BN
> bpt: n
> bpb: "#"
> Bidi_M: N
> bmg: ""
> suc: "#"
> slc: "#"
> stc: "#"
> uc: "#"
> lc: "#"
> tc: "#"
> scf: "#"
> cf: "#"
> jt: U
> jg: No_Joining_Group
> ea: N
> lb: CM
> sc: Zyyy
> scx: Zyyy
> Dash: N
> WSpace: N
> Hyphen: N
> QMark: N
> Radical: N
> Ideo: N
> UIdeo: N
> IDSB: N
> IDST: N
> hst: NA
> DI: N
> ODI: N
> Alpha: N
> OAlpha: N
> Upper: N
> OUpper: N
> Lower: N
> OLower: N
> Math: N
> OMath: N
> Hex: N
> AHex: N
> NChar: N
> VS: N
> Bidi_C: N
> Join_C: N
> Gr_Base: N
> Gr_Ext: N
> OGr_Ext: N
> Gr_Link: N
> STerm: N
> Ext: N
> Term: N
> Dia: N
> Dep: N
> IDS: N
> OIDS: N
> XIDS: N
> IDC: N
> OIDC: N
> XIDC: N
> SD: N
> LOE: N
> Pat_WS: N
> Pat_Syn: N
> GCB: CN
> WB: XX
> SB: XX
> CE: N
> Comp_Ex: N
> NFC_QC: Y
> NFD_QC: Y
> NFKC_QC: Y
> NFKD_QC: Y
> XO_NFC: N
> XO_NFD: N
> XO_NFKC: N
> XO_NFKD: N
> FC_NFKC: "#"
> CI: N
> Cased: N
> CWCF: N
> CWCM: N
> CWKCF: N
> CWL: N
> CWT: N
> CWU: N
> NFKC_CF: "#"
> InSC: Other
> InPC: NA
> PCM: N
> blk: ASCII
> isc: ""
> 
> -
> cp: 0001
> na1: "START OF HEADING"
> name_alias: 
> - [SOH,abbreviation]
> - [START OF HEADING,control]
> props: *
> 
> 
> 
> 
> 
> Regards,
> 
> Marius Spix
> 
> 
> On Sat, 1 Sep 2018 08:00:02 +0200 (CEST)
> schrieb Marcel Schneider wrote:
> 
[…]



Re: UCD in XML or in CSV?

2018-09-01 Thread Richard Wordingham via Unicode
On Fri, 31 Aug 2018 10:36:45 +0200
Manuel Strehl via Unicode  wrote:

> For me it's currently much easier to have all the data in a single
> place, e.g. a large XML file, than spread over a multitude of files
> _with different ad-hoc syntaxes_.
> 
> The situation would possibly be different, though, if the UCD data
> would be split in several files of the same format. (Be it JSON, CSV,
> YAML, XML, TOML, whatever. Just be consistent.)

Most properties are stored in pretty much the same format in the UCD
files. UnicodeData.txt is the major exception; it seems to date from
when the set of properties was expected to be stable.

The big exception is set-valued properties.  PropList.txt can be viewed
as having an odd syntax for storing the set of miscellaneous Boolean
properties for which the codepoint has the value of 'true'.

Richard.


Re: UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20)

2018-09-01 Thread Marius Spix via Unicode
Hello Marcel,

YAML supports references, so you can refer to another character’s
properties.

Example:

repertoire: 
 char:
  -
   name_alias: 
- [NUL,abbreviation]
- ["NULL",control]
   cp: 
   na1: "NULL"
   props: &
 age: "1.1"
 na: ""
 JSN: ""
 gc: Cc
 ccc: 0
 dt: none
 dm: "#"
 nt: None
 nv: NaN
 bc: BN
 bpt: n
 bpb: "#"
 Bidi_M: N
 bmg: ""
 suc: "#"
 slc: "#"
 stc: "#"
 uc: "#"
 lc: "#"
 tc: "#"
 scf: "#"
 cf: "#"
 jt: U
 jg: No_Joining_Group
 ea: N
 lb: CM
 sc: Zyyy
 scx: Zyyy
 Dash: N
 WSpace: N
 Hyphen: N
 QMark: N
 Radical: N
 Ideo: N
 UIdeo: N
 IDSB: N
 IDST: N
 hst: NA
 DI: N
 ODI: N
 Alpha: N
 OAlpha: N
 Upper: N
 OUpper: N
 Lower: N
 OLower: N
 Math: N
 OMath: N
 Hex: N
 AHex: N
 NChar: N
 VS: N
 Bidi_C: N
 Join_C: N
 Gr_Base: N
 Gr_Ext: N
 OGr_Ext: N
 Gr_Link: N
 STerm: N
 Ext: N
 Term: N
 Dia: N
 Dep: N
 IDS: N
 OIDS: N
 XIDS: N
 IDC: N
 OIDC: N
 XIDC: N
 SD: N
 LOE: N
 Pat_WS: N
 Pat_Syn: N
 GCB: CN
 WB: XX
 SB: XX
 CE: N
 Comp_Ex: N
 NFC_QC: Y
 NFD_QC: Y
 NFKC_QC: Y
 NFKD_QC: Y
 XO_NFC: N
 XO_NFD: N
 XO_NFKC: N
 XO_NFKD: N
 FC_NFKC: "#"
 CI: N
 Cased: N
 CWCF: N
 CWCM: N
 CWKCF: N
 CWL: N
 CWT: N
 CWU: N
 NFKC_CF: "#"
 InSC: Other
 InPC: NA
 PCM: N
 blk: ASCII
 isc: ""

  -
   cp: 0001
   na1: "START OF HEADING"
   name_alias: 
- [SOH,abbreviation]
- [START OF HEADING,control]
   props: *





Regards,

Marius Spix


On Sat, 1 Sep 2018 08:00:02 +0200 (CEST)
schrieb Marcel Schneider wrote:

> On 31/08/18 08:25 Marius Spix via Unicode wrote:
> > 
> > A good compromise between human readability, machine processability
> > and filesize would be using YAML.
> > 
> > Unlike JSON, YAML supports comments, anchors and references,
> > multiple documents in a file and several other features.
> 
> Thanks for advice. Already I do use YAML syntaxic highlighting to
> display XCompose files, that use the colon as a separator, too.
> 
> Did you figure out how YAML would fit UCD data? It appears to heavily
> rely on line breaks, that may get lost as data turns around across
> environments. XML indentation is only a readability feature and
> irrelevant to content. The structure is independent of invisible
> characters and is stable if only graphics are not corrupted (while it
> may happen that they are). Linebreaks are odd in that they are
> inconsistent across OSes, because Unicode was denied the right to
> impose a unique standard in that matter. The result is mashed-up
> files, and I fear YAML might not hold out.
> 
> Like XML, YAML needs to repeat attribute names in every instance.
> That is precisely what CSV gets around of, at the expense of
> readability in plain text. Personally I could use YAML as I do use
> XML for lookup in the text editor, but I’m afraid that there is no
> advantage over CSV with respect to file size.
> 
> Regards,
> 
> Marcel
> > 
> > Regards,
> > 
> > Marius Spix
> > 
> > 
> > On Fri, 31 Aug 2018 06:58:37 +0200 (CEST) Marcel Schneider via
> > Unicode wrote:
> > 
> […]



pgpMN17QQjRHP.pgp
Description: Digitale Signatur von OpenPGP


Re: UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20)

2018-09-01 Thread Marcel Schneider via Unicode
On 31/08/18 08:25 Marius Spix via Unicode wrote:
> 
> A good compromise between human readability, machine processability and
> filesize would be using YAML.
> 
> Unlike JSON, YAML supports comments, anchors and references, multiple
> documents in a file and several other features.

Thanks for advice. Already I do use YAML syntaxic highlighting to display 
XCompose files, that use the colon as a separator, too.

Did you figure out how YAML would fit UCD data? It appears to heavily rely
on line breaks, that may get lost as data turns around across environments.
XML indentation is only a readability feature and irrelevant to content. The 
structure is independent of invisible characters and is stable if only graphics
are not corrupted (while it may happen that they are). Linebreaks are odd in
that they are inconsistent across OSes, because Unicode was denied the 
right to impose a unique standard in that matter. The result is mashed-up 
files, and I fear YAML might not hold out.

Like XML, YAML needs to repeat attribute names in every instance. That 
is precisely what CSV gets around of, at the expense of readability in 
plain text. Personally I could use YAML as I do use XML for lookup in
the text editor, but I’m afraid that there is no advantage over CSV with
respect to file size.

Regards,

Marcel
> 
> Regards,
> 
> Marius Spix
> 
> 
> On Fri, 31 Aug 2018 06:58:37 +0200 (CEST) Marcel Schneider via Unicode
> wrote:
> 
[…]