Re: [gentoo-dev] New schema language for metadata validation?

2016-01-27 Thread Ulrich Mueller
> On Wed, 27 Jan 2016, Michał Górny wrote:

> Do we use the  variant anywhere? If not, I suggest we
> drop it since it's completely unclear to me and only pollutes the
> schema.

+1

Here is the commit (from 2003) that had introduced the packages
element:
https://sources.gentoo.org/cgi-bin/viewvc.cgi/gentoo/xml/htdocs/dtd/metadata.dtd?revision=1.4=markup

Ulrich


pgp8jFhxDkjTj.pgp
Description: PGP signature


Re: [gentoo-dev] New schema language for metadata validation?

2016-01-27 Thread Dirkjan Ochtman
On Wed, Jan 27, 2016 at 4:43 PM, Michał Górny  wrote:
> Yes, that part makes some sense. Except that it immediately follows
> braces which makes me think it applies only to the thing in the braces.
> Furthermore, the use of {} vs () seems pretty much random, and the &
> is completely unclear what it could mean (I'm not reading the docs
> here!). I look at it and I wonder if that forces some ordering or not,
> if it supports interspersing or not. And finally, the fact that '*'
> follows closing brace on the other end of file does not help
> readability.

A full expression runs from a keyword to the closing brace (e.g.
"element  { }"). Parentheses are only used to group expressions
that have the infix operators (one of "," for concatenate -- e.g.,
order --, "|" -- alternative -- and "&" -- interleaving), because
there's no implicit precedence order. Yes, the quantizer comes at the
end. On the other hand, you could rewrite the large expression to be a
named pattern, and then put the quantization after the pattern name if
you prefer that style. That would probably make more sense for large
named patterns. :)

Cheers,

Dirkjan



Re: [gentoo-dev] New schema language for metadata validation?

2016-01-27 Thread Ulrich Mueller
> On Wed, 27 Jan 2016, Michał Górny wrote:

> First of all, I don't like RELAX-NG Compact at all. It looks like
> someone tried hard to combine some variation of BNF, DOCTYPE and
> something else in order to get something that is both readable and
> compact. And got a result that doesn't meet either criteria. It
> looks like some terrible mixture of over-verbose descriptive text
> format with a lot of enigmatic symbols that are not even clear what
> they apply to.

> Secondly, RELAX-NG and XML Schema look pretty similar in volume.
> However, XML Schema looks definitely more readable, robust and
> XML-ish (and doesn't use camelcase!). Furthermore, as far as I'm
> aware XML Schema is more widely supported (not sure if that applies
> to any tools we're considering).

> Therefore, I'd suggest we just ship properly hand-written XML
> Schema, with some nice comments. I don't see a reason to ship any
> RELAX-NG files unless we actually have tools that support only that.

Emacs nXML mode supports only RNC. Do we have a tool (i.e. a package
in the tree) for automatic conversion from XML Schema to RNC?

Ulrich


pgp7lsFJ1f_Mq.pgp
Description: PGP signature


Re: [gentoo-dev] New schema language for metadata validation?

2016-01-27 Thread Michał Górny
On Tue, 26 Jan 2016 20:52:09 +0100
Dirkjan Ochtman  wrote:

> TL;DR: I think we should switch from DTD to RELAX NG (compact syntax,
> ideally) for our XML validation needs. It is more expressive and more
> readable.

Oh, one more thing.

Do we use the  variant anywhere? If not, I suggest we drop
it since it's completely unclear to me and only pollutes the schema.

-- 
Best regards,
Michał Górny



pgp5kreUsSg63.pgp
Description: OpenPGP digital signature


Re: [gentoo-dev] New schema language for metadata validation?

2016-01-27 Thread Dirkjan Ochtman
On Wed, Jan 27, 2016 at 4:20 PM, Michał Górny  wrote:
> First of all, I don't like RELAX-NG Compact at all. It looks like
> someone tried hard to combine some variation of BNF, DOCTYPE
> and something else in order to get something that is both readable
> and compact. And got a result that doesn't meet either criteria.
> It looks like some terrible mixture of over-verbose descriptive text
> format with a lot of enigmatic symbols that are not even clear what
> they apply to.

Wow, that's surprising to me! I found that a lot of the compact syntax
made immediate sense to me as I was already familiar with what ?*+
mean from EBNF and regular expressions. For me, it's mostly how much
less verbose it is than a full XML syntax that makes it easier to
comprehend and manipulate.

> Secondly, RELAX-NG and XML Schema look pretty similar in volume.
> However, XML Schema looks definitely more readable, robust and XML-ish
> (and doesn't use camelcase!). Furthermore, as far as I'm aware XML
> Schema is more widely supported (not sure if that applies to any tools
> we're considering).

I agree that XML Schema is probably more widely supported, though it'd
be hard to assess by how much. On other hand, I find XML Schema much
less readable; and it feels like "more XML-ish" is just because it
uses namespaces a lot more, and is more commonly used? Indeed, to me
the fact that RELAX NG is less XML-ish is a positive aspect.

> Therefore, I'd suggest we just ship properly hand-written XML Schema,
> with some nice comments. I don't see a reason to ship any RELAX-NG
> files unless we actually have tools that support only that.

I'd be curious what Michael, Ulrich, and others think.

Cheers,

Dirkjan



Re: [gentoo-dev] New schema language for metadata validation?

2016-01-27 Thread Michał Górny
On Wed, 27 Jan 2016 16:28:13 +0100
Dirkjan Ochtman  wrote:

> On Wed, Jan 27, 2016 at 4:20 PM, Michał Górny  wrote:
> > First of all, I don't like RELAX-NG Compact at all. It looks like
> > someone tried hard to combine some variation of BNF, DOCTYPE
> > and something else in order to get something that is both readable
> > and compact. And got a result that doesn't meet either criteria.
> > It looks like some terrible mixture of over-verbose descriptive text
> > format with a lot of enigmatic symbols that are not even clear what
> > they apply to.  
> 
> Wow, that's surprising to me! I found that a lot of the compact syntax
> made immediate sense to me as I was already familiar with what ?*+
> mean from EBNF and regular expressions. For me, it's mostly how much
> less verbose it is than a full XML syntax that makes it easier to
> comprehend and manipulate.

Yes, that part makes some sense. Except that it immediately follows
braces which makes me think it applies only to the thing in the braces.
Furthermore, the use of {} vs () seems pretty much random, and the &
is completely unclear what it could mean (I'm not reading the docs
here!). I look at it and I wonder if that forces some ordering or not,
if it supports interspersing or not. And finally, the fact that '*'
follows closing brace on the other end of file does not help
readability.

> > Secondly, RELAX-NG and XML Schema look pretty similar in volume.
> > However, XML Schema looks definitely more readable, robust and XML-ish
> > (and doesn't use camelcase!). Furthermore, as far as I'm aware XML
> > Schema is more widely supported (not sure if that applies to any tools
> > we're considering).  
> 
> I agree that XML Schema is probably more widely supported, though it'd
> be hard to assess by how much. On other hand, I find XML Schema much
> less readable; and it feels like "more XML-ish" is just because it
> uses namespaces a lot more, and is more commonly used? Indeed, to me
> the fact that RELAX NG is less XML-ish is a positive aspect.

When you use XML, use XML. If you don't want XML, don't use XML. Don't
try to make non-XML out of XML.

-- 
Best regards,
Michał Górny



pgptU0cl6GLyn.pgp
Description: OpenPGP digital signature


Re: [gentoo-dev] New schema language for metadata validation?

2016-01-27 Thread Michael Orlitzky
On 01/27/2016 10:28 AM, Dirkjan Ochtman wrote:
> 
>> Therefore, I'd suggest we just ship properly hand-written XML Schema,
>> with some nice comments. I don't see a reason to ship any RELAX-NG
>> files unless we actually have tools that support only that.
> 
> I'd be curious what Michael, Ulrich, and others think.
> 

I'm ambivalent, I gave up my emotional attachment to XML back when HTML
jumped the shark. Being able to type-check a document is important, but
XSD/RNG are a means to an end and I could put up with either one.




Re: [gentoo-dev] New schema language for metadata validation?

2016-01-27 Thread Michał Górny
On Tue, 26 Jan 2016 20:52:09 +0100
Dirkjan Ochtman  wrote:

> All,
> 
> TL;DR: I think we should switch from DTD to RELAX NG (compact syntax,
> ideally) for our XML validation needs. It is more expressive and more
> readable.
> 
> Most people who know anything about XML stuff know that DTDs are not
> that great a solution for validation. Their expression power is very
> limited; there are a few examples of this is in our metadata.dtd [1].
> For a few years now, I've wanted to see if we could replace
> metadata.dtd with something in RELAX NG, which is a more modern XML
> schema language; it's an ISO standard with an emphasis on readability
> both for humans and for tools (by using a rigorous formalism). Some
> arguments in favor of RELAX NG (and some counter-arguments) are
> enumerated on Tim Bray's weblog [2]. I've created a compact syntax
> schema for metadata that can validate all metadata.xml files currently
> in the tree, as an example [3].
> 
> Some arguments against:
> 
> - Not enough tool support for RELAX NG: I'd be curious to hear what
> tools you want to use. At least libxml2 supports RELAX NG natively.
> The Python lxml library uses that support to provide pretty simple
> RELAX NG validation. libxml2 does not have native compact syntax
> support, but I maintain a simple library called rnc2rng [4] that is
> used transparently by lxml if installed. rnc2rng also comes with a
> rnc2rng command-line script to do the conversion.
> 
> - Performance: in a quick test with lxml (backed by libxml2), RELAX NG
> validation takes very similar time compared to DTD. Testing with
> ~19000 metadata.xml files in the tree, with DTD (best of 3):
> 
> real0m2.861s
> user0m2.560s
> sys0m0.296s
> 
> With RNC (best of 3):
> 
> real0m3.058s
> user0m2.688s
> sys0m0.364s
> 
> We could probably easily maintain an XML Schema shadow schema if
> that's really desired, but I would be in favor of making RELAX NG our
> main schema language. I can easily do the work to update repoman for
> this (I've already refactored the metadata code in repoman). What
> other stuff would need to be updated?
> 
> Comments?

Could you post a generated .rng and XML Schema files for comparison?
They don't have to be perfect conversions, just to see how different
they are.

-- 
Best regards,
Michał Górny



pgp0qWpv8SyPi.pgp
Description: OpenPGP digital signature


Re: [gentoo-dev] New schema language for metadata validation?

2016-01-27 Thread Dirkjan Ochtman
On Tue, Jan 26, 2016 at 9:52 PM, Michael Orlitzky  wrote:
> I would appreciate examples of some common tasks like validating
> projects.xml, but since we don't have those now, it's not critical.
> This used to be kinda straightforward with xmllint,
>
>   $ xmllint --valid --noout projects.xml && echo "OK"

The closest equivalent for this is just:

xmllint --relaxng metadata.rng --noout metadata.xml && echo "OK"

I.e., you have to specify the schema file manually (also, as mentioned
before, libxml2 does not support RNC natively, so you have to convert
to RNG first -- but we can keep those around). You can use a non-HTTPS
URL for --relaxng, as well.

There is a standard to link an XML file to a RELAX NG (XML or compact
syntax) schema, here:

http://www.w3.org/TR/xml-model/

But libxml2 does not seem to support it; that is, substituting the
DOCTYPE for an xml-model processing instruction and then using xmllint
--valid does not do the right thing (it complains there's no DOCTYPE).

Cheers,

Dirkjan



Re: [gentoo-dev] New schema language for metadata validation?

2016-01-27 Thread Brian Dolbec
On Wed, 27 Jan 2016 14:39:02 +0100
Dirkjan Ochtman  wrote:

> On Wed, Jan 27, 2016 at 1:09 PM, Michał Górny 
> wrote:
> > Could you post a generated .rng and XML Schema files for comparison?
> > They don't have to be perfect conversions, just to see how different
> > they are.  
> 
> Here's the RNG, generated with dev-python/rnc2rng:
> 
> https://raw.githubusercontent.com/djc/gentoo-data-dtd/metadata-rnc/metadata.rng
> 
> The best way to convert from RELAX NG to XML Schema seems to be with
> trang; I downloaded an older binary and a JDK on my laptop, but
> couldn't easily get it to run. I don't really have a Gentoo machine on
> which I want to install the whole Java shebang, so maybe someone else
> can run a quick conversion?
> 
> Cheers,
> 
> Dirkjan
> 

that looks very easy to read and modify.

-- 
Brian Dolbec 




Re: [gentoo-dev] New schema language for metadata validation?

2016-01-27 Thread Michael Orlitzky
On 01/27/2016 04:22 AM, Dirkjan Ochtman wrote:
> On Tue, Jan 26, 2016 at 9:52 PM, Michael Orlitzky  wrote:
>> I would appreciate examples of some common tasks like validating
>> projects.xml, but since we don't have those now, it's not critical.
>> This used to be kinda straightforward with xmllint,
>>
>>   $ xmllint --valid --noout projects.xml && echo "OK"
> 
> The closest equivalent for this is just:
> 
> xmllint --relaxng metadata.rng --noout metadata.xml && echo "OK"
> 

Ok, so basically the same situation we have now. You have to figure out
where to get the rng and supply it manually...


> I.e., you have to specify the schema file manually (also, as mentioned
> before, libxml2 does not support RNC natively, so you have to convert
> to RNG first -- but we can keep those around). You can use a non-HTTPS
> URL for --relaxng, as well.
> 
> There is a standard to link an XML file to a RELAX NG (XML or compact
> syntax) schema, here:
> 
> http://www.w3.org/TR/xml-model/

Nice! I was looking for this.


> But libxml2 does not seem to support it; that is, substituting the
> DOCTYPE for an xml-model processing instruction and then using xmllint
> --valid does not do the right thing (it complains there's no DOCTYPE).

But does it /complain/ about the xml-model? Is it safe to add that to
our XML files (in terms of tooling and stability of the spec)? If so, I
can at least script the validation: parse the href from xml-model, fetch
it somehow, run it through rnc2rng, and then pass it to xmllint.

Or we could even generate the rng files automatically and host them like
we do the DTDs to skip a step.




Re: [gentoo-dev] New schema language for metadata validation?

2016-01-27 Thread Dirkjan Ochtman
On Wed, Jan 27, 2016 at 1:09 PM, Michał Górny  wrote:
> Could you post a generated .rng and XML Schema files for comparison?
> They don't have to be perfect conversions, just to see how different
> they are.

Here's the RNG, generated with dev-python/rnc2rng:

https://raw.githubusercontent.com/djc/gentoo-data-dtd/metadata-rnc/metadata.rng

The best way to convert from RELAX NG to XML Schema seems to be with
trang; I downloaded an older binary and a JDK on my laptop, but
couldn't easily get it to run. I don't really have a Gentoo machine on
which I want to install the whole Java shebang, so maybe someone else
can run a quick conversion?

Cheers,

Dirkjan



Re: [gentoo-dev] New schema language for metadata validation?

2016-01-27 Thread Dirkjan Ochtman
On Wed, Jan 27, 2016 at 3:21 PM, Michael Orlitzky  wrote:
>> But libxml2 does not seem to support it; that is, substituting the
>> DOCTYPE for an xml-model processing instruction and then using xmllint
>> --valid does not do the right thing (it complains there's no DOCTYPE).
>
> But does it /complain/ about the xml-model? Is it safe to add that to
> our XML files (in terms of tooling and stability of the spec)? If so, I
> can at least script the validation: parse the href from xml-model, fetch
> it somehow, run it through rnc2rng, and then pass it to xmllint.

It does not seem to complain about the xml-model, so that should be
quite viable.

Can I ask what your interest is? What tools are you involved with that
would want to use this?

> Or we could even generate the rng files automatically and host them like
> we do the DTDs to skip a step.

Yeah, that's what I was thinking. Too bad that we have to drag around
both, but I think the advantages in terms of readability and
modifiability for RNC and tool support for RNG really do make it the
best solution to have canonical RNCs with pre-generated RNGs.

Cheers,

Dirkjan



Re: [gentoo-dev] New schema language for metadata validation?

2016-01-27 Thread Ulrich Mueller
> On Wed, 27 Jan 2016, Dirkjan Ochtman wrote:

> Here's the RNG, generated with dev-python/rnc2rng:

> https://raw.githubusercontent.com/djc/gentoo-data-dtd/metadata-rnc/metadata.rng

> The best way to convert from RELAX NG to XML Schema seems to be with
> trang; I downloaded an older binary and a JDK on my laptop, but
> couldn't easily get it to run. I don't really have a Gentoo machine on
> which I want to install the whole Java shebang, so maybe someone else
> can run a quick conversion?

Voila: http://dev.gentoo.org/~ulm/metadata.xsd

This was generated with:
$ trang -I rng -O xsd 
https://raw.githubusercontent.com/djc/gentoo-data-dtd/metadata-rnc/metadata.rng 
metadata.xsd

Ulrich


pgpMhNwpf9lYu.pgp
Description: PGP signature


Re: [gentoo-dev] New schema language for metadata validation?

2016-01-27 Thread Michał Górny
On Wed, 27 Jan 2016 14:39:02 +0100
Dirkjan Ochtman  wrote:

> On Wed, Jan 27, 2016 at 1:09 PM, Michał Górny  wrote:
> > Could you post a generated .rng and XML Schema files for comparison?
> > They don't have to be perfect conversions, just to see how different
> > they are.  
> 
> Here's the RNG, generated with dev-python/rnc2rng:
> 
> https://raw.githubusercontent.com/djc/gentoo-data-dtd/metadata-rnc/metadata.rng
> 
> The best way to convert from RELAX NG to XML Schema seems to be with
> trang; I downloaded an older binary and a JDK on my laptop, but
> couldn't easily get it to run. I don't really have a Gentoo machine on
> which I want to install the whole Java shebang, so maybe someone else
> can run a quick conversion?

Thanks to you and to ulm for the XML Schema conversion. Now my points.

First of all, I don't like RELAX-NG Compact at all. It looks like
someone tried hard to combine some variation of BNF, DOCTYPE
and something else in order to get something that is both readable
and compact. And got a result that doesn't meet either criteria.
It looks like some terrible mixture of over-verbose descriptive text
format with a lot of enigmatic symbols that are not even clear what
they apply to.

Secondly, RELAX-NG and XML Schema look pretty similar in volume.
However, XML Schema looks definitely more readable, robust and XML-ish
(and doesn't use camelcase!). Furthermore, as far as I'm aware XML
Schema is more widely supported (not sure if that applies to any tools
we're considering).

Therefore, I'd suggest we just ship properly hand-written XML Schema,
with some nice comments. I don't see a reason to ship any RELAX-NG
files unless we actually have tools that support only that.

-- 
Best regards,
Michał Górny



pgpXBZwVlwbd_.pgp
Description: OpenPGP digital signature


[gentoo-dev] New schema language for metadata validation?

2016-01-26 Thread Dirkjan Ochtman
All,

TL;DR: I think we should switch from DTD to RELAX NG (compact syntax,
ideally) for our XML validation needs. It is more expressive and more
readable.

Most people who know anything about XML stuff know that DTDs are not
that great a solution for validation. Their expression power is very
limited; there are a few examples of this is in our metadata.dtd [1].
For a few years now, I've wanted to see if we could replace
metadata.dtd with something in RELAX NG, which is a more modern XML
schema language; it's an ISO standard with an emphasis on readability
both for humans and for tools (by using a rigorous formalism). Some
arguments in favor of RELAX NG (and some counter-arguments) are
enumerated on Tim Bray's weblog [2]. I've created a compact syntax
schema for metadata that can validate all metadata.xml files currently
in the tree, as an example [3].

Some arguments against:

- Not enough tool support for RELAX NG: I'd be curious to hear what
tools you want to use. At least libxml2 supports RELAX NG natively.
The Python lxml library uses that support to provide pretty simple
RELAX NG validation. libxml2 does not have native compact syntax
support, but I maintain a simple library called rnc2rng [4] that is
used transparently by lxml if installed. rnc2rng also comes with a
rnc2rng command-line script to do the conversion.

- Performance: in a quick test with lxml (backed by libxml2), RELAX NG
validation takes very similar time compared to DTD. Testing with
~19000 metadata.xml files in the tree, with DTD (best of 3):

real0m2.861s
user0m2.560s
sys0m0.296s

With RNC (best of 3):

real0m3.058s
user0m2.688s
sys0m0.364s

We could probably easily maintain an XML Schema shadow schema if
that's really desired, but I would be in favor of making RELAX NG our
main schema language. I can easily do the work to update repoman for
this (I've already refactored the metadata code in repoman). What
other stuff would need to be updated?

Comments?

Cheers,

Dirkjan

[1] https://github.com/djc/gentoo-data-dtd/blob/metadata-rnc/metadata.dtd
[2] https://www.tbray.org/ongoing/When/200x/2006/11/27/Choose-Relax
[3] https://github.com/djc/gentoo-data-dtd/blob/metadata-rnc/metadata.rnc
[4] https://github.com/djc/rnc2rng



Re: [gentoo-dev] New schema language for metadata validation?

2016-01-26 Thread Michael Orlitzky
On 01/26/2016 02:52 PM, Dirkjan Ochtman wrote:
> All,
> 
> TL;DR: I think we should switch from DTD to RELAX NG (compact syntax,
> ideally) for our XML validation needs. It is more expressive and more
> readable.
> 

A great idea.


> What other stuff would need to be updated?
> 

I would appreciate examples of some common tasks like validating
projects.xml, but since we don't have those now, it's not critical.
This used to be kinda straightforward with xmllint,

  $ xmllint --valid --noout projects.xml && echo "OK"

but now that www.gentoo.org is on HTTPS, even that doesn't work. There's
an example in the devmanual that needs updating along those lines (at
the bottom of the page). In fact, all of the herds junk needs to be updated:

  https://devmanual.gentoo.org/ebuild-writing/misc-files/metadata/

Our DTDs are available under https://www.gentoo.org/dtd/ -- do we need
to put the rnc files somewhere accessible? Or do we only need the DTDs
public for the DOCTYPE declarations?

Thanks for preemptively hacking repoman.