Re: UCD in XML or in CSV? (is: UCD data consumption)
I’m not responding without thinking, as I was blamed of when I did, but it is painful for me to dig into what Ken explained about how we should be consuming UCD data. I’ll now try to get some more clarity into the topic. > On 31/08/18 19:59 Ken Whistler via Unicode wrote: > […] > > > > Third, please remember that folks who come here complaining about the > > complications of parsing the UCD are a very small percentage of a very > > small percentage of a very small percentage of interested parties. OK, among avg. 700 list subscribers, relatively few are ever complaining about anything, let alone about this particular topic. But we should always keep in mind that many folks out there complaining about Unicode don’t come here to do so. > > Nearly everybody who needs UCD data should be consuming it as a > > secondary source (e.g. for reference via codepoints.net), or as a > > tertiary source (behind specialized API's, regex, etc.), Like already suggested, “as” should probably read “via” in that part. > > or as an end > > user (just getting behavior they expect for characters in applications). That is more than a simple statement about who is consuming UCD data which way, as you say “should.” There seem to be assumptions that it is discouraged to dive into the raw data; that folks reading file headers are not doing well; that the data should be assembled only in certain ways; and that ignorant people shouldn’t open the UCD cupboard to pick a file they deem useful. If so, then it might be surprising to know that when submitting a proposal about Bidi-mirroring mathematical symbols issues feedback http://www.unicode.org/L2/L2017/17438-bidi-math-fdbk.html I’d started as a quasi-end-user not getting behavior I expected for characters in browsers, as I was spotting characters bidi-mirrored by glyph exchange, like it is implemented in web browsers, because I wanted that end-users could experience bidi-mirroring as it works. Unexpectedly a number of math symbols did not mirror, despite many of them being even scalar neighbors. > > Programmers who actually *need* to consume the raw UCD data files and > > write parsers for them directly should actually be able to deal with the > > format complexity -- and, if anything, slowing them down to make them > > think about the reasons for the format complexity might be a good thing, I can see one main reason for the format complexity, and that is that data from various propeties don’t necessarily telescope the same way to make for small files. The complexity of UCD would then mainly be self-induced by the way of packing data into one small file per property rather than adding the value to each relevant code point in one large list as is UnicodeData.txt. While I’m now taking the time to write this up because I’m committed to process that information, we can think of many many people who don’t like to be slowed down trying to find out why Unicode changed UCD design while following the original idea of a large CSV list would be straightforward, eventually by setting up a new one if the first one got stuck. What I can figure out is that while a new property was added, that particular property was always thought of as being the last one. (At some point the many files were then dumped into the known XML files.) If UCD is to be made of small files, it is necessarily complex, and the conclusion is that there should be another large CSV grid to make things simple again and lightweight alike so far as they can. > > as it tends to put the lie to the easy initial assumption that the UCD > > is nothing more than a bunch of simple attributes for all the code points. Did you try the sentence when taking off “simple”? It appears to me as not being a lie then. One attribute comes to mind that is so complex that its design even changed over time, despite Unicode’s commitment to stability. The Bidi_Mirrored_Glyph property was originally designed to include “best-fit” pairs for least-worse display in applications not supporting RTL glyphs (ie without OpenType support), with legibility of math formulae in mind. Later (probably due to a poorly written OpenType spec), no more best-fit pairs were added to BidiMirroring.txt, as if OpenType implementers weren’t to remove the best-fit pairs anyway prior to using the file (while the spec says to use it as-is). That led then to the display problem pointed above. I’m sparing the particular problem related to 3 pairs of symbols with tilde, nor to the missing Bidi_Mirroring_Type property, given UTC was not interested. So you can understand that I’m not unaware of the complexity of UCD. Though I don’t think that this could be an argument for not publishing a medium-size CSV file with scalar values listed as in UnicodeData.txt. > > […] > Even Excel Starter, that I have, is a great tool helping > to perform tasks I fail to get with other tools, even spreadsheet software. Ie not every spreadsheet
Re: UCD in XML or in CSV? (is: Parsing UCD in XML)
On 31/08/18 10:47 Manuel Strehl via Unicode wrote: > > To handle the UCD XML file a streaming parser like Expat is necessary. Thanks for the tip. However for my needs, Expat looks like overkill, and I’m looking out for a much simpler standalone tool, just converting XML to CSV. > > For codepoints.net I use that data […] Very good site IMO, as it compiles a lot of useful information trying to maximize human readability. Nice to have added the Adopt-a-character button, too. Thanks, Marcel
Re: UCD in XML or in CSV? (is: UCD in YAML)
Thank you Marius for the example. Indeed I now see that YAML is a powerful means for a file to have an intuitive readability while drastically reducing file size. BTW what I conjectured about the role of line breaks is true for CSV too, and any file downloaded from UCD on a semicolon separator basis becomes unusable when displayed straight in the built-in text editor of Windows, given Unicode uses Unix EOL. Still for use in spreadsheets, YAML needs to be converted to CSV, although that might not crash the browser as large XML does. Regards, Marcel On 01/09/18 09:18 Marius Spix via Unicode wrote: > > Hello Marcel, > > YAML supports references, so you can refer to another character’s > properties. > > Example: > > repertoire: > char: > - > name_alias: > - [NUL,abbreviation] > - ["NULL",control] > cp: > na1: "NULL" > props: & > age: "1.1" > na: "" > JSN: "" > gc: Cc > ccc: 0 > dt: none > dm: "#" > nt: None > nv: NaN > bc: BN > bpt: n > bpb: "#" > Bidi_M: N > bmg: "" > suc: "#" > slc: "#" > stc: "#" > uc: "#" > lc: "#" > tc: "#" > scf: "#" > cf: "#" > jt: U > jg: No_Joining_Group > ea: N > lb: CM > sc: Zyyy > scx: Zyyy > Dash: N > WSpace: N > Hyphen: N > QMark: N > Radical: N > Ideo: N > UIdeo: N > IDSB: N > IDST: N > hst: NA > DI: N > ODI: N > Alpha: N > OAlpha: N > Upper: N > OUpper: N > Lower: N > OLower: N > Math: N > OMath: N > Hex: N > AHex: N > NChar: N > VS: N > Bidi_C: N > Join_C: N > Gr_Base: N > Gr_Ext: N > OGr_Ext: N > Gr_Link: N > STerm: N > Ext: N > Term: N > Dia: N > Dep: N > IDS: N > OIDS: N > XIDS: N > IDC: N > OIDC: N > XIDC: N > SD: N > LOE: N > Pat_WS: N > Pat_Syn: N > GCB: CN > WB: XX > SB: XX > CE: N > Comp_Ex: N > NFC_QC: Y > NFD_QC: Y > NFKC_QC: Y > NFKD_QC: Y > XO_NFC: N > XO_NFD: N > XO_NFKC: N > XO_NFKD: N > FC_NFKC: "#" > CI: N > Cased: N > CWCF: N > CWCM: N > CWKCF: N > CWL: N > CWT: N > CWU: N > NFKC_CF: "#" > InSC: Other > InPC: NA > PCM: N > blk: ASCII > isc: "" > > - > cp: 0001 > na1: "START OF HEADING" > name_alias: > - [SOH,abbreviation] > - [START OF HEADING,control] > props: * > > > > > > Regards, > > Marius Spix > > > On Sat, 1 Sep 2018 08:00:02 +0200 (CEST) > schrieb Marcel Schneider wrote: > […]
Re: UCD in XML or in CSV?
On Fri, 31 Aug 2018 10:36:45 +0200 Manuel Strehl via Unicode wrote: > For me it's currently much easier to have all the data in a single > place, e.g. a large XML file, than spread over a multitude of files > _with different ad-hoc syntaxes_. > > The situation would possibly be different, though, if the UCD data > would be split in several files of the same format. (Be it JSON, CSV, > YAML, XML, TOML, whatever. Just be consistent.) Most properties are stored in pretty much the same format in the UCD files. UnicodeData.txt is the major exception; it seems to date from when the set of properties was expected to be stable. The big exception is set-valued properties. PropList.txt can be viewed as having an odd syntax for storing the set of miscellaneous Boolean properties for which the codepoint has the value of 'true'. Richard.
Re: UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20)
Hello Marcel, YAML supports references, so you can refer to another character’s properties. Example: repertoire: char: - name_alias: - [NUL,abbreviation] - ["NULL",control] cp: na1: "NULL" props: & age: "1.1" na: "" JSN: "" gc: Cc ccc: 0 dt: none dm: "#" nt: None nv: NaN bc: BN bpt: n bpb: "#" Bidi_M: N bmg: "" suc: "#" slc: "#" stc: "#" uc: "#" lc: "#" tc: "#" scf: "#" cf: "#" jt: U jg: No_Joining_Group ea: N lb: CM sc: Zyyy scx: Zyyy Dash: N WSpace: N Hyphen: N QMark: N Radical: N Ideo: N UIdeo: N IDSB: N IDST: N hst: NA DI: N ODI: N Alpha: N OAlpha: N Upper: N OUpper: N Lower: N OLower: N Math: N OMath: N Hex: N AHex: N NChar: N VS: N Bidi_C: N Join_C: N Gr_Base: N Gr_Ext: N OGr_Ext: N Gr_Link: N STerm: N Ext: N Term: N Dia: N Dep: N IDS: N OIDS: N XIDS: N IDC: N OIDC: N XIDC: N SD: N LOE: N Pat_WS: N Pat_Syn: N GCB: CN WB: XX SB: XX CE: N Comp_Ex: N NFC_QC: Y NFD_QC: Y NFKC_QC: Y NFKD_QC: Y XO_NFC: N XO_NFD: N XO_NFKC: N XO_NFKD: N FC_NFKC: "#" CI: N Cased: N CWCF: N CWCM: N CWKCF: N CWL: N CWT: N CWU: N NFKC_CF: "#" InSC: Other InPC: NA PCM: N blk: ASCII isc: "" - cp: 0001 na1: "START OF HEADING" name_alias: - [SOH,abbreviation] - [START OF HEADING,control] props: * Regards, Marius Spix On Sat, 1 Sep 2018 08:00:02 +0200 (CEST) schrieb Marcel Schneider wrote: > On 31/08/18 08:25 Marius Spix via Unicode wrote: > > > > A good compromise between human readability, machine processability > > and filesize would be using YAML. > > > > Unlike JSON, YAML supports comments, anchors and references, > > multiple documents in a file and several other features. > > Thanks for advice. Already I do use YAML syntaxic highlighting to > display XCompose files, that use the colon as a separator, too. > > Did you figure out how YAML would fit UCD data? It appears to heavily > rely on line breaks, that may get lost as data turns around across > environments. XML indentation is only a readability feature and > irrelevant to content. The structure is independent of invisible > characters and is stable if only graphics are not corrupted (while it > may happen that they are). Linebreaks are odd in that they are > inconsistent across OSes, because Unicode was denied the right to > impose a unique standard in that matter. The result is mashed-up > files, and I fear YAML might not hold out. > > Like XML, YAML needs to repeat attribute names in every instance. > That is precisely what CSV gets around of, at the expense of > readability in plain text. Personally I could use YAML as I do use > XML for lookup in the text editor, but I’m afraid that there is no > advantage over CSV with respect to file size. > > Regards, > > Marcel > > > > Regards, > > > > Marius Spix > > > > > > On Fri, 31 Aug 2018 06:58:37 +0200 (CEST) Marcel Schneider via > > Unicode wrote: > > > […] pgpMN17QQjRHP.pgp Description: Digitale Signatur von OpenPGP
Re: UCD in XML or in CSV? (was: Re: Unicode Digest, Vol 56, Issue 20)
On 31/08/18 08:25 Marius Spix via Unicode wrote: > > A good compromise between human readability, machine processability and > filesize would be using YAML. > > Unlike JSON, YAML supports comments, anchors and references, multiple > documents in a file and several other features. Thanks for advice. Already I do use YAML syntaxic highlighting to display XCompose files, that use the colon as a separator, too. Did you figure out how YAML would fit UCD data? It appears to heavily rely on line breaks, that may get lost as data turns around across environments. XML indentation is only a readability feature and irrelevant to content. The structure is independent of invisible characters and is stable if only graphics are not corrupted (while it may happen that they are). Linebreaks are odd in that they are inconsistent across OSes, because Unicode was denied the right to impose a unique standard in that matter. The result is mashed-up files, and I fear YAML might not hold out. Like XML, YAML needs to repeat attribute names in every instance. That is precisely what CSV gets around of, at the expense of readability in plain text. Personally I could use YAML as I do use XML for lookup in the text editor, but I’m afraid that there is no advantage over CSV with respect to file size. Regards, Marcel > > Regards, > > Marius Spix > > > On Fri, 31 Aug 2018 06:58:37 +0200 (CEST) Marcel Schneider via Unicode > wrote: > […]