Hi Wendell, thanks for your point of view. If you decide not to introduce a schema for your data, and if you have the chance to prepare your input before adding it to the database, you may now mark all your mixed content with xml:space="preserve".
One question to Liam: do you remember why "strip" is not a valid option for the xml:space attribute? Christian ______________________________________ > Liam points out something very important: it is possible in principle > to distinguish between whitespace that can be safely discarded (by > design) and whitespace that can't -- if you have a schema or other > specification that represents this design. > > As he notes, the XML Rec distinguishes between "significant" and > "insignificant" whitespace by reference to content models that do and > don't include #PCDATA (that is, whitespace that appears in "element > content" or "mixed content"; cf > http://www.w3.org/TR/REC-xml/#dt-elemcontent). If your content model > for div says (p+), then whitespace between the 'p' element children of > a 'div' (but not inside them) may often be judged safe to discard. (At > least in a system in which a schema is used as a warrant of fitness > for processing.) > > When technologies such as XQuery or XSLT are designed to work with and > without schemas, however -- or where schemas cannot be considered as > reliable indicators of markup semantics -- even relying on this > mechanism can't solve the problem (to say nothing of deciding which > schema languages you support). However, it can help to mitigate it. > > Then too, even XSLT 1.0 has strip-space and preserve-space > configuration to indicate to a processor where it can "chop" > whitespace. While it's a bit crude (it treats all elements with the > same name the same), it can be useful. > > Over the longer term, therefore, I think that (1) CHOP needs to be > "false" by default, (2) it should be possible to turn it on (just as I > am learning how to turn it off), and also (3) that we also need more > flexible and configurable means for discriminating how it should work, > with and without schemas to reference. > > Cheers, Wendell > > > > On Sat, Apr 13, 2013 at 7:05 AM, Christian Grün > <christian.gr...@gmail.com> wrote: >> I’d like to add some more info on why we initially decided to chop >> whitespaces, and why a sudden change of the default value may break >> existing applications (if you know the details, simply skip this >> section..): >> >> Many XML documents contain whitespace-only text nodes for properly >> indenting elements. In highly structured data (i.e., when not working >> with mixed content), these nodes are in fact completely irrelevant. >> For example, if the following document… >> >> <xml> >> <a>X</a> >> </xml> >> >> …is parsed with CHOP set to true, we will get a document with a single >> text node. The following query… >> >> for $t in //text() >> return replace node $t with 'x' >> >> …will generate the following result: >> >> <xml> >> <a>x</a> >> </xml> >> >> If we set CHOP to false, the document will have three text nodes, two >> of them whitespace-only, and the same query will create the following >> result document: >> >> <xml>x<a>x</a>x</xml> >> >> This is just one example to demonstrate that a sudden change of the >> default for chop would most probably lead to unwanted side effects in >> existing applications. Another side effect: databases are expected to >> increase in size, as all whitespace nodes will get their own node ids, >> will be fully stored and indexed, etc. >> >> However, I completely agree that the removal of whitespaces may lead >> to serious changes in mixed contents, and I easily admit that we >> haven’t been aware of all the implications some years ago when we >> started off designing the database. While I still believe that our >> storage copes pretty well with nowaday’s requirements, I would love to >> have some weeks off to completely rebuild it, and include >> optimizations for all kinds of features that are relevant today >> (including larger ranges for node ids and namespaces, or support for >> other tree formats such as json). >> >> Thanks for reading, >> Christian >> ___________________________ >> >> On Sat, Apr 13, 2013 at 8:28 AM, Liam R E Quin <l...@w3.org> wrote: >>> On Fri, 2013-04-05 at 11:31 +0200, Dirk Kirsten wrote: >>> >>>> So if you could point out some details as why this is not conforming >>>> behaviour, this would be interesting. >>> >>> It's a requirement in the XML Spec that the XML parser pass all >>> whitespace back to the application. Some whitespace may be marked as not >>> significant - that is only possible if there's a DTD and the space is in >>> a context where only elements would be valid, not #PCDATA. There's no >>> formal specification, although constructing an XDM instance from an >>> infoset, and constructing an infoset from XML, does not entail >>> discarding these spaces: >>> Chopping internal whitespace nodes in mixed content contexts is not >>> sanctioned by any version of any XML specification, with any setting of >>> xml:space. I think the onus would be on you to justify the non-standard >>> behaviour. >>> >>> On the other hand I can see its uses too. But I don't want it, and >>> always turn it off with BaseX :-) >>> >>> Best, >>> >>> Liam >>> >>> -- >>> Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/ >>> Pictures from old books: http://fromoldbooks.org/ >>> Ankh: irc.sorcery.net irc.gnome.org freenode/#xml >>> >>> _______________________________________________ >>> BaseX-Talk mailing list >>> BaseX-Talk@mailman.uni-konstanz.de >>> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk >> _______________________________________________ >> BaseX-Talk mailing list >> BaseX-Talk@mailman.uni-konstanz.de >> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk > > > > -- > Wendell Piez | http://www.wendellpiez.com > XML | XSLT | electronic publishing > Eat Your Vegetables > _____oo_________o_o___ooooo____ooooooo_^ > > > -- > Wendell Piez | http://www.wendellpiez.com > XML | XSLT | electronic publishing > Eat Your Vegetables > _____oo_________o_o___ooooo____ooooooo_^ > _______________________________________________ > BaseX-Talk mailing list > BaseX-Talk@mailman.uni-konstanz.de > https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk _______________________________________________ BaseX-Talk mailing list BaseX-Talk@mailman.uni-konstanz.de https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk