Re: [basex-talk] whitespace around comments
On Fri, 2013-04-05 at 11:31 +0200, Dirk Kirsten wrote: So if you could point out some details as why this is not conforming behaviour, this would be interesting. It's a requirement in the XML Spec that the XML parser pass all whitespace back to the application. Some whitespace may be marked as not significant - that is only possible if there's a DTD and the space is in a context where only elements would be valid, not #PCDATA. There's no formal specification, although constructing an XDM instance from an infoset, and constructing an infoset from XML, does not entail discarding these spaces: Chopping internal whitespace nodes in mixed content contexts is not sanctioned by any version of any XML specification, with any setting of xml:space. I think the onus would be on you to justify the non-standard behaviour. On the other hand I can see its uses too. But I don't want it, and always turn it off with BaseX :-) Best, Liam -- Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/ Pictures from old books: http://fromoldbooks.org/ Ankh: irc.sorcery.net irc.gnome.org freenode/#xml ___ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Re: [basex-talk] whitespace around comments
I’d like to add some more info on why we initially decided to chop whitespaces, and why a sudden change of the default value may break existing applications (if you know the details, simply skip this section..): Many XML documents contain whitespace-only text nodes for properly indenting elements. In highly structured data (i.e., when not working with mixed content), these nodes are in fact completely irrelevant. For example, if the following document… xml aX/a /xml …is parsed with CHOP set to true, we will get a document with a single text node. The following query… for $t in //text() return replace node $t with 'x' …will generate the following result: xml ax/a /xml If we set CHOP to false, the document will have three text nodes, two of them whitespace-only, and the same query will create the following result document: xmlxax/ax/xml This is just one example to demonstrate that a sudden change of the default for chop would most probably lead to unwanted side effects in existing applications. Another side effect: databases are expected to increase in size, as all whitespace nodes will get their own node ids, will be fully stored and indexed, etc. However, I completely agree that the removal of whitespaces may lead to serious changes in mixed contents, and I easily admit that we haven’t been aware of all the implications some years ago when we started off designing the database. While I still believe that our storage copes pretty well with nowaday’s requirements, I would love to have some weeks off to completely rebuild it, and include optimizations for all kinds of features that are relevant today (including larger ranges for node ids and namespaces, or support for other tree formats such as json). Thanks for reading, Christian ___ On Sat, Apr 13, 2013 at 8:28 AM, Liam R E Quin l...@w3.org wrote: On Fri, 2013-04-05 at 11:31 +0200, Dirk Kirsten wrote: So if you could point out some details as why this is not conforming behaviour, this would be interesting. It's a requirement in the XML Spec that the XML parser pass all whitespace back to the application. Some whitespace may be marked as not significant - that is only possible if there's a DTD and the space is in a context where only elements would be valid, not #PCDATA. There's no formal specification, although constructing an XDM instance from an infoset, and constructing an infoset from XML, does not entail discarding these spaces: Chopping internal whitespace nodes in mixed content contexts is not sanctioned by any version of any XML specification, with any setting of xml:space. I think the onus would be on you to justify the non-standard behaviour. On the other hand I can see its uses too. But I don't want it, and always turn it off with BaseX :-) Best, Liam -- Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/ Pictures from old books: http://fromoldbooks.org/ Ankh: irc.sorcery.net irc.gnome.org freenode/#xml ___ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk ___ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Re: [basex-talk] whitespace around comments
Hi Christian, Am 12.04.2013 um 10:49 schrieb Christian Grün christian.gr...@gmail.com : our CHOP flag is subject to frequent discussions, which is why we will eventually change the default to FALSE. I really second that! For now, we are still a little bit resistant, as such a change will change the behavior of existing BaseX applications out there, so we’ll probably combine the switch with the next major release. For now, you can preserve whitespaces by e.g.. -- adding the line CHOP=false in your .basex configuration file -- using the basex command-line flag -w -- using set chop false as first command, or setting the options in any other way described in our Wiki [1]. The problem is, that you will be aware of this only AFTER you created a DB and worked with it. Unfortunately, users are not informed when creating a DB that they should think about whitespace. And there is no reason a user should assume that creating a DB would semantically change their data. In the Digital Humanities, it is all about mixed content (another major issue, I think) as in TEI-annotated data and of course this involves whitespace. The worst thing at the moment is that you cannot get back your whitespace once you figure out that you should have preserved it actively. I had to recreate the DB and recode node-IDs in dependent DBs and so on. So, yes please, make preserving whitespace the default behavior! Best regards Cerstin -- Dr. phil. Cerstin Mahlow Universität Basel Departement Sprach- und Literaturwissenschaften Fachbereich Deutsche Sprach- und Literaturwissenschaft Nadelberg 4 4051 Basel Schweiz Tel: +41 61 267 07 65 Fax: +41 61 267 34 40 Mail: cerstin.mah...@unibas.ch Web: http://www.oldphras.net ___ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Re: [basex-talk] whitespace around comments
The problem is, that you will be aware of this only AFTER you created a DB and worked with it. Unfortunately, users are not informed when creating a DB that they should think about whitespace. And there is no reason a user should assume that creating a DB would semantically change their data. [...] Yes, I absolutely agree. After all, it’s always tricky to handle issues that have some historical roots. To improve things a little, I have added support for the xml:space attributes in the latest snapshot [1]. If you add this attribute to an element, all whitespaces in the descendant text nodes will be preserved: a xml:space=preserve babc/b /a Note that the XML snippet above now contains three text nodes instead of one, which means that the generated database will obviously take more space. If you want to reduce memory consumption, the xml:space attributes should either be added to the relevant elements.. a b xml:space=preserveabc/b /a ..or the XML indentations should be removed from the document: a xml:space=preservebabc/b/a Hope this helps, Christian [1] http://files.basex.org/releases/latest/ ___ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Re: [basex-talk] whitespace around comments
On Fri, 2013-04-05 at 11:15 +0200, Michael Piotrowski wrote: On 2013-04-05, Michael Seiferle m...@basex.org wrote: chopping certainly *does* change the semantics--that's precisely why I've argued before that it shouldn't be on by default. Agreed, but Christian has already said it will be off by default in the next release. I have seen a commercial SGML formatter that had a similar behaviour used for aircraft manuals, where there was actually a possibility of lives lost and unlimited civil damage liability as a result of numbers run together, but I failed to get the people in charge to understand why it made a difference. (and BaseX doesn't honor xml:space either). The latest snapshot does. Liam -- Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/ Pictures from old books: http://fromoldbooks.org/ Ankh: irc.sorcery.net irc.gnome.org freenode/#xml ___ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Re: [basex-talk] whitespace around comments
On 2013-04-05, Michael Seiferle m...@basex.org wrote: As chopping does not change any semantics (at least with regards to what XML thinks of semantically important) but only aesthetics this is enabled by default. I'm sorry to disagree, but chopping certainly *does* change the semantics--that's precisely why I've argued before that it shouldn't be on by default. The problem becomes obvious with mixed content, e.g., with chopping enabled doc pLorem ipsum emdolor/em xsit/x amet .../p /doc becomes doc pLorem ipsumemdolor/emxsit/xamet .../p /doc which is *not* the same, and AFAIKT this is not conforming behavior (and BaseX doesn't honor xml:space either). I do understand that whitespace chopping as currently implemented is useful for some data-oriented applications, even if it is not conforming, but by default, the behavior should conform to the XML standard. Best regards -- Dr.-Ing. Michael Piotrowski, M.A. m...@cl.uzh.ch Institute of Computational Linguistics, University of Zurich Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044 * OUT NOW: Natural Language Processing for Historical Texts * http://morganclaypool.com/doi/abs/10.2200/S00436ED1V01Y201207HLT017 ___ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Re: [basex-talk] whitespace around comments
Hello Michael, You are certainly right that with mixed content and the example you have given here chopping does make a semantic difference. However, you can disable this behaviour so BaseX does what you want. So the only reason I see why one should change the default behaviour would be because the default is not confirmant to some XML standard. However, I can not find any specifics in the spec about which is the expected behaviour, so in my opinion BaseX is doing nothing wrong here. I see that this behaviour might be surprising for some users, but this might as well be the case if it were the other way round. Additionally, if we would change this now it would break application code and unless there is a good reason (i.e. BaseX is actually doing something wrong or non-compliant) I don't see why one should change the default. So if you could point out some details as why this is not conforming behaviour, this would be interesting. Cheers, Dirk On Fri, Apr 5, 2013 at 11:15 AM, Michael Piotrowski m...@cl.uzh.ch wrote: On 2013-04-05, Michael Seiferle m...@basex.org wrote: As chopping does not change any semantics (at least with regards to what XML thinks of semantically important) but only aesthetics this is enabled by default. I'm sorry to disagree, but chopping certainly *does* change the semantics--that's precisely why I've argued before that it shouldn't be on by default. The problem becomes obvious with mixed content, e.g., with chopping enabled doc pLorem ipsum emdolor/em xsit/x amet .../p /doc becomes doc pLorem ipsumemdolor/emxsit/xamet .../p /doc which is *not* the same, and AFAIKT this is not conforming behavior (and BaseX doesn't honor xml:space either). I do understand that whitespace chopping as currently implemented is useful for some data-oriented applications, even if it is not conforming, but by default, the behavior should conform to the XML standard. Best regards -- Dr.-Ing. Michael Piotrowski, M.A. m...@cl.uzh.ch Institute of Computational Linguistics, University of Zurich Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044 * OUT NOW: Natural Language Processing for Historical Texts * http://morganclaypool.com/doi/abs/10.2200/S00436ED1V01Y201207HLT017 ___ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk -- Dirk Kirsten, BaseX GmbH, http://basex.org |-- Firmensitz: Blarerstrasse 56, 78462 Konstanz |-- Registergericht Freiburg, HRB: 708285, Geschäftsführer: | Dr. Christian Grün, Alexander Holupirek, Michael Seiferle `-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22 ___ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Re: [basex-talk] whitespace around comments
Dirk, On 2013-04-05, Dirk Kirsten d...@basex.org wrote: You are certainly right that with mixed content and the example you have given here chopping does make a semantic difference. However, you can disable this behaviour so BaseX does what you want. So the only reason I see why one should change the default behaviour would be because the default is not confirmant to some XML standard. However, I can not find any specifics in the spec about which is the expected behaviour, so in my opinion BaseX is doing nothing wrong here. Well, if you agree that chopping may alter the semantics of a document, wouldn't you agree that applying such a transformation *by default* is a bad idea? With respect to the XML specification, section 2.10 White Space Handling says: An XML processor MUST always pass all characters in a document that are not markup through to the application. Yes, the spec is vague wrt. to whitespace handling, and the existence of the xml:space attribute shows that different behaviors--including potentially corrupting ones--are possible. I would therefore interpret the spec to mean that by default all characters should be preserved, but that other behaviors are possible. I see that this behaviour might be surprising for some users, but this might as well be the case if it were the other way round. No, because their documents wouldn't be corrupted. You can easily remove all whitespace afterwards if you decide you don't need it, but once it's gone, it's gone and cannot be restored. That's the problem. Additionally, if we would change this now it would break application code and unless there is a good reason (i.e. BaseX is actually doing something wrong or non-compliant) I don't see why one should change the default. Well, I'm not on a crusade or anything, so if you believe that it's a good idea to corrupt, by default, all documents containing mixed content on import, or if this behavior must be kept for compatiblity, so be it. I just wanted to point out that whitespace chopping may, in fact, alter the semantics of documents--it's not as harmless as it may seem. Best regards -- Dr.-Ing. Michael Piotrowski, M.A. m...@cl.uzh.ch Institute of Computational Linguistics, University of Zurich Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044 * OUT NOW: Natural Language Processing for Historical Texts * http://morganclaypool.com/doi/abs/10.2200/S00436ED1V01Y201207HLT017 ___ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Re: [basex-talk] whitespace around comments
Michael (other than me :-)) you are obviously right. — Mit freundlichen Grüßen Michael Seiferle On Fri, Apr 5, 2013 at 12:29 PM, Michael Piotrowski m...@cl.uzh.ch wrote: Dirk, On 2013-04-05, Dirk Kirsten d...@basex.org wrote: You are certainly right that with mixed content and the example you have given here chopping does make a semantic difference. However, you can disable this behaviour so BaseX does what you want. So the only reason I see why one should change the default behaviour would be because the default is not confirmant to some XML standard. However, I can not find any specifics in the spec about which is the expected behaviour, so in my opinion BaseX is doing nothing wrong here. Well, if you agree that chopping may alter the semantics of a document, wouldn't you agree that applying such a transformation *by default* is a bad idea? With respect to the XML specification, section 2.10 White Space Handling says: An XML processor MUST always pass all characters in a document that are not markup through to the application. Yes, the spec is vague wrt. to whitespace handling, and the existence of the xml:space attribute shows that different behaviors--including potentially corrupting ones--are possible. I would therefore interpret the spec to mean that by default all characters should be preserved, but that other behaviors are possible. I see that this behaviour might be surprising for some users, but this might as well be the case if it were the other way round. No, because their documents wouldn't be corrupted. You can easily remove all whitespace afterwards if you decide you don't need it, but once it's gone, it's gone and cannot be restored. That's the problem. Additionally, if we would change this now it would break application code and unless there is a good reason (i.e. BaseX is actually doing something wrong or non-compliant) I don't see why one should change the default. Well, I'm not on a crusade or anything, so if you believe that it's a good idea to corrupt, by default, all documents containing mixed content on import, or if this behavior must be kept for compatiblity, so be it. I just wanted to point out that whitespace chopping may, in fact, alter the semantics of documents--it's not as harmless as it may seem. Best regards -- Dr.-Ing. Michael Piotrowski, M.A. m...@cl.uzh.ch Institute of Computational Linguistics, University of Zurich Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044 * OUT NOW: Natural Language Processing for Historical Texts * http://morganclaypool.com/doi/abs/10.2200/S00436ED1V01Y201207HLT017 ___ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk___ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
Re: [basex-talk] whitespace around comments
http://www.w3.org/TR/REC-xml/#sec-white-space ...On the other hand, significant white space that should be preserved... So since your parser by default creates significant whitespace where there was none, and removes it where there was, perhaps it could be fixed please, without the user needing to take special steps. Also that would make doc() agree with let:= as I mentioned above. ___ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk