Re: [basex-talk] Fwd: whitespace around comments

Christian Grün Thu, 18 Apr 2013 13:04:19 -0700

Hi Wendell,

thanks for your point of view. If you decide not to introduce a schema
for your data, and if you have the chance to prepare your input before
adding it to the database, you may now mark all your mixed content
with xml:space="preserve".


One question to Liam: do you remember why "strip" is not a valid
option for the xml:space attribute?

Christian
______________________________________

> Liam points out something very important: it is possible in principle
> to distinguish between whitespace that can be safely discarded (by
> design) and whitespace that can't -- if you have a schema or other
> specification that represents this design.
>
> As he notes, the XML Rec distinguishes between "significant" and
> "insignificant" whitespace by reference to content models that do and
> don't include #PCDATA (that is, whitespace that appears in "element
> content" or "mixed content"; cf
> http://www.w3.org/TR/REC-xml/#dt-elemcontent). If your content model
> for div says (p+), then whitespace between the 'p' element children of
> a 'div' (but not inside them) may often be judged safe to discard. (At
> least in a system in which a schema is used as a warrant of fitness
> for processing.)
>
> When technologies such as XQuery or XSLT are designed to work with and
> without schemas, however -- or where schemas cannot be considered as
> reliable indicators of markup semantics -- even relying on this
> mechanism can't solve the problem (to say nothing of deciding which
> schema languages you support). However, it can help to mitigate it.
>
> Then too, even XSLT 1.0 has strip-space and preserve-space
> configuration to indicate to a processor where it can "chop"
> whitespace. While it's a bit crude (it treats all elements with the
> same name the same), it can be useful.
>
> Over the longer term, therefore, I think that (1) CHOP needs to be
> "false" by default, (2) it should be possible to turn it on (just as I
> am learning how to turn it off), and also (3) that we also need more
> flexible and configurable means for discriminating how it should work,
> with and without schemas to reference.
>
> Cheers, Wendell
>
>
>
> On Sat, Apr 13, 2013 at 7:05 AM, Christian Grün
> <christian.gr...@gmail.com> wrote:
>> I’d like to add some more info on why we initially decided to chop
>> whitespaces, and why a sudden change of the default value may break
>> existing applications (if you know the details, simply skip this
>> section..):
>>
>> Many XML documents contain whitespace-only text nodes for properly
>> indenting elements. In highly structured data (i.e., when not working
>> with mixed content), these nodes are in fact completely irrelevant.
>> For example, if the following document…
>>
>> <xml>
>>   <a>X</a>
>> </xml>
>>
>> …is parsed with CHOP set to true, we will get a document with a single
>> text node. The following query…
>>
>>   for $t in //text()
>>   return replace node $t with 'x'
>>
>> …will generate the following result:
>>
>> <xml>
>>   <a>x</a>
>> </xml>
>>
>> If we set CHOP to false, the document will have three text nodes, two
>> of them whitespace-only, and the same query will create the following
>> result document:
>>
>> <xml>x<a>x</a>x</xml>
>>
>> This is just one example to demonstrate that a sudden change of the
>> default for chop would most probably lead to unwanted side effects in
>> existing applications. Another side effect: databases are expected to
>> increase in size, as all whitespace nodes will get their own node ids,
>> will be fully stored and indexed, etc.
>>
>> However, I completely agree that the removal of whitespaces may lead
>> to serious changes in mixed contents, and I easily admit that we
>> haven’t been aware of all the implications some years ago when we
>> started off designing the database. While I still believe that our
>> storage copes pretty well with nowaday’s requirements, I would love to
>> have some weeks off to completely rebuild it, and include
>> optimizations for all kinds of features that are relevant today
>> (including larger ranges for node ids and namespaces, or support for
>> other tree formats such as json).
>>
>> Thanks for reading,
>> Christian
>> ___________________________
>>
>> On Sat, Apr 13, 2013 at 8:28 AM, Liam R E Quin <l...@w3.org> wrote:
>>> On Fri, 2013-04-05 at 11:31 +0200, Dirk Kirsten wrote:
>>>
>>>> So if you could point out some details as why this is not conforming
>>>> behaviour, this would be interesting.
>>>
>>> It's a requirement in the XML Spec that the XML parser pass all
>>> whitespace back to the application. Some whitespace may be marked as not
>>> significant - that is only possible if there's a DTD and the space is in
>>> a context where only elements would be valid, not #PCDATA. There's no
>>> formal specification, although constructing an XDM instance from an
>>> infoset, and constructing an infoset from XML, does not entail
>>> discarding these spaces:
>>> Chopping internal whitespace nodes in mixed content contexts is not
>>> sanctioned by any version of any XML specification, with any setting of
>>> xml:space. I think the onus would be on you to justify the non-standard
>>> behaviour.
>>>
>>> On the other hand I can see its uses too. But I don't want it, and
>>> always turn it off with BaseX :-)
>>>
>>> Best,
>>>
>>> Liam
>>>
>>> --
>>> Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
>>> Pictures from old books: http://fromoldbooks.org/
>>> Ankh: irc.sorcery.net irc.gnome.org freenode/#xml
>>>
>>> _______________________________________________
>>> BaseX-Talk mailing list
>>> BaseX-Talk@mailman.uni-konstanz.de
>>> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
>> _______________________________________________
>> BaseX-Talk mailing list
>> BaseX-Talk@mailman.uni-konstanz.de
>> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
>
>
>
> --
> Wendell Piez | http://www.wendellpiez.com
> XML | XSLT | electronic publishing
> Eat Your Vegetables
> _____oo_________o_o___ooooo____ooooooo_^
>
>
> --
> Wendell Piez | http://www.wendellpiez.com
> XML | XSLT | electronic publishing
> Eat Your Vegetables
> _____oo_________o_o___ooooo____ooooooo_^
> _______________________________________________
> BaseX-Talk mailing list
> BaseX-Talk@mailman.uni-konstanz.de
> https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
_______________________________________________
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk

Re: [basex-talk] Fwd: whitespace around comments

Reply via email to