Re: [basex-talk] whitespace around comments

2013-04-13 Thread Liam R E Quin
On Fri, 2013-04-05 at 11:31 +0200, Dirk Kirsten wrote:

 So if you could point out some details as why this is not conforming
 behaviour, this would be interesting.

It's a requirement in the XML Spec that the XML parser pass all
whitespace back to the application. Some whitespace may be marked as not
significant - that is only possible if there's a DTD and the space is in
a context where only elements would be valid, not #PCDATA. There's no
formal specification, although constructing an XDM instance from an
infoset, and constructing an infoset from XML, does not entail
discarding these spaces:
Chopping internal whitespace nodes in mixed content contexts is not
sanctioned by any version of any XML specification, with any setting of
xml:space. I think the onus would be on you to justify the non-standard
behaviour.

On the other hand I can see its uses too. But I don't want it, and
always turn it off with BaseX :-)

Best,

Liam

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org freenode/#xml

___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-13 Thread Christian Grün
I’d like to add some more info on why we initially decided to chop
whitespaces, and why a sudden change of the default value may break
existing applications (if you know the details, simply skip this
section..):

Many XML documents contain whitespace-only text nodes for properly
indenting elements. In highly structured data (i.e., when not working
with mixed content), these nodes are in fact completely irrelevant.
For example, if the following document…

xml
  aX/a
/xml

…is parsed with CHOP set to true, we will get a document with a single
text node. The following query…

  for $t in //text()
  return replace node $t with 'x'

…will generate the following result:

xml
  ax/a
/xml

If we set CHOP to false, the document will have three text nodes, two
of them whitespace-only, and the same query will create the following
result document:

xmlxax/ax/xml

This is just one example to demonstrate that a sudden change of the
default for chop would most probably lead to unwanted side effects in
existing applications. Another side effect: databases are expected to
increase in size, as all whitespace nodes will get their own node ids,
will be fully stored and indexed, etc.

However, I completely agree that the removal of whitespaces may lead
to serious changes in mixed contents, and I easily admit that we
haven’t been aware of all the implications some years ago when we
started off designing the database. While I still believe that our
storage copes pretty well with nowaday’s requirements, I would love to
have some weeks off to completely rebuild it, and include
optimizations for all kinds of features that are relevant today
(including larger ranges for node ids and namespaces, or support for
other tree formats such as json).

Thanks for reading,
Christian
___

On Sat, Apr 13, 2013 at 8:28 AM, Liam R E Quin l...@w3.org wrote:
 On Fri, 2013-04-05 at 11:31 +0200, Dirk Kirsten wrote:

 So if you could point out some details as why this is not conforming
 behaviour, this would be interesting.

 It's a requirement in the XML Spec that the XML parser pass all
 whitespace back to the application. Some whitespace may be marked as not
 significant - that is only possible if there's a DTD and the space is in
 a context where only elements would be valid, not #PCDATA. There's no
 formal specification, although constructing an XDM instance from an
 infoset, and constructing an infoset from XML, does not entail
 discarding these spaces:
 Chopping internal whitespace nodes in mixed content contexts is not
 sanctioned by any version of any XML specification, with any setting of
 xml:space. I think the onus would be on you to justify the non-standard
 behaviour.

 On the other hand I can see its uses too. But I don't want it, and
 always turn it off with BaseX :-)

 Best,

 Liam

 --
 Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
 Pictures from old books: http://fromoldbooks.org/
 Ankh: irc.sorcery.net irc.gnome.org freenode/#xml

 ___
 BaseX-Talk mailing list
 BaseX-Talk@mailman.uni-konstanz.de
 https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-12 Thread Cerstin Elisabeth Mahlow
Hi Christian,

Am 12.04.2013 um 10:49 schrieb Christian Grün christian.gr...@gmail.com
:

 our CHOP flag is subject to frequent discussions, which is why we will
 eventually change the default to FALSE.

I really second that!

 For now, we are still a little
 bit resistant, as such a change will change the behavior of existing
 BaseX applications out there, so we’ll probably combine the switch
 with the next major release.
 
 For now, you can preserve whitespaces by e.g..
 
 -- adding the line CHOP=false in your .basex configuration file
 -- using the basex command-line flag -w
 -- using set chop false as first command, or setting the options in
 any other way described in our Wiki [1].


The problem is, that you will be aware of this only AFTER you created a DB and 
worked with it.  Unfortunately, users are not informed when creating a DB that 
they should think about whitespace.  And there is no reason a user should 
assume that creating a DB would semantically change their data. 

In the Digital Humanities, it is all about mixed content (another major issue, 
I think) as in TEI-annotated data and of course this involves whitespace.  The 
worst thing at the moment is that you cannot get back your whitespace once you 
figure out that you should have preserved it actively.  I had to recreate the 
DB and recode node-IDs in dependent DBs and so on.

So, yes please, make preserving whitespace the default behavior!

Best regards

Cerstin
-- 
Dr. phil. Cerstin Mahlow

Universität Basel
Departement Sprach- und Literaturwissenschaften
Fachbereich Deutsche Sprach- und Literaturwissenschaft
Nadelberg 4
4051 Basel
Schweiz

Tel:  +41 61 267 07 65
Fax: +41 61 267 34 40
Mail: cerstin.mah...@unibas.ch
Web: http://www.oldphras.net

___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-12 Thread Christian Grün
 The problem is, that you will be aware of this only AFTER you created a DB 
 and worked with it.  Unfortunately, users are not informed when creating a DB 
 that they should think about whitespace.  And there is no reason a user 
 should assume that creating a DB would semantically change their data. [...]

Yes, I absolutely agree. After all, it’s always tricky to handle
issues that have some historical roots.

To improve things a little, I have added support for the xml:space
attributes in the latest snapshot [1]. If you add this attribute to an
element, all whitespaces in the descendant text nodes will be
preserved:

  a xml:space=preserve
babc/b
  /a

Note that the XML snippet above now contains three text nodes instead
of one, which means that the generated database will obviously take
more space. If you want to reduce memory consumption, the xml:space
attributes should either be added to the relevant elements..

  a
b xml:space=preserveabc/b
  /a

..or the XML indentations should be removed from the document:

  a xml:space=preservebabc/b/a

Hope this helps,
Christian

[1] http://files.basex.org/releases/latest/
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-12 Thread Liam R E Quin
On Fri, 2013-04-05 at 11:15 +0200, Michael Piotrowski wrote:
 On 2013-04-05, Michael Seiferle m...@basex.org wrote:
 chopping certainly *does* change the
 semantics--that's precisely why I've argued before that it shouldn't be
 on by default.

Agreed, but Christian has already said it will be off by default in the
next release.

I have seen a commercial SGML formatter that had a similar behaviour
used for aircraft manuals, where there was actually a possibility of
lives lost and unlimited civil damage liability as a result of numbers
run together, but I failed to get the people in charge to understand why
it made a difference.

  (and
 BaseX doesn't honor xml:space either).
The latest snapshot does.

Liam

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org freenode/#xml

___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-05 Thread Michael Piotrowski
On 2013-04-05, Michael Seiferle m...@basex.org wrote:

 As chopping does not change any semantics (at least with regards to
 what XML thinks of semantically important) but only aesthetics this is
 enabled by default.

I'm sorry to disagree, but chopping certainly *does* change the
semantics--that's precisely why I've argued before that it shouldn't be
on by default.

The problem becomes obvious with mixed content, e.g., with chopping
enabled

doc
  pLorem ipsum emdolor/em xsit/x amet .../p
/doc

becomes

doc
  pLorem ipsumemdolor/emxsit/xamet .../p
/doc

which is *not* the same, and AFAIKT this is not conforming behavior (and
BaseX doesn't honor xml:space either).

I do understand that whitespace chopping as currently implemented is
useful for some data-oriented applications, even if it is not
conforming, but by default, the behavior should conform to the XML
standard.

Best regards

-- 
Dr.-Ing. Michael Piotrowski, M.A. m...@cl.uzh.ch
Institute of Computational Linguistics, University of Zurich
Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044
* OUT NOW: Natural Language Processing for Historical Texts
* http://morganclaypool.com/doi/abs/10.2200/S00436ED1V01Y201207HLT017
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-05 Thread Dirk Kirsten
Hello Michael,

You are certainly right that with mixed content and the example you have
given here chopping does make a semantic difference.
However, you can disable this behaviour so BaseX does what you want. So the
only reason I see why one should change the default behaviour would be
because the default is not confirmant to some XML standard. However, I can
not find any specifics in the spec about which is the expected behaviour,
so in my opinion BaseX is doing nothing wrong here.
I see that this behaviour might be surprising for some users, but this
might as well be the case if it were the other way round. Additionally, if
we would change this now it would break application code and unless there
is a good reason (i.e. BaseX is actually doing something wrong or
non-compliant) I don't see why one should change the default.
So if you could point out some details as why this is not conforming
behaviour, this would be interesting.

Cheers,
Dirk


On Fri, Apr 5, 2013 at 11:15 AM, Michael Piotrowski m...@cl.uzh.ch wrote:

 On 2013-04-05, Michael Seiferle m...@basex.org wrote:

  As chopping does not change any semantics (at least with regards to
  what XML thinks of semantically important) but only aesthetics this is
  enabled by default.

 I'm sorry to disagree, but chopping certainly *does* change the
 semantics--that's precisely why I've argued before that it shouldn't be
 on by default.

 The problem becomes obvious with mixed content, e.g., with chopping
 enabled

 doc
   pLorem ipsum emdolor/em xsit/x amet .../p
 /doc

 becomes

 doc
   pLorem ipsumemdolor/emxsit/xamet .../p
 /doc

 which is *not* the same, and AFAIKT this is not conforming behavior (and
 BaseX doesn't honor xml:space either).

 I do understand that whitespace chopping as currently implemented is
 useful for some data-oriented applications, even if it is not
 conforming, but by default, the behavior should conform to the XML
 standard.

 Best regards

 --
 Dr.-Ing. Michael Piotrowski, M.A. m...@cl.uzh.ch
 Institute of Computational Linguistics, University of Zurich
 Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044
 * OUT NOW: Natural Language Processing for Historical Texts
 * http://morganclaypool.com/doi/abs/10.2200/S00436ED1V01Y201207HLT017
 ___
 BaseX-Talk mailing list
 BaseX-Talk@mailman.uni-konstanz.de
 https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk




-- 
Dirk Kirsten, BaseX GmbH, http://basex.org
|-- Firmensitz: Blarerstrasse 56, 78462 Konstanz
|-- Registergericht Freiburg, HRB: 708285, Geschäftsführer:
|   Dr. Christian Grün, Alexander Holupirek, Michael Seiferle
`-- Phone: 0049 7531 28 28 676, Fax: 0049 7531 20 05 22
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-05 Thread Michael Piotrowski
Dirk,

On 2013-04-05, Dirk Kirsten d...@basex.org wrote:

 You are certainly right that with mixed content and the example you have
 given here chopping does make a semantic difference.
 However, you can disable this behaviour so BaseX does what you want. So the
 only reason I see why one should change the default behaviour would be
 because the default is not confirmant to some XML standard. However, I can
 not find any specifics in the spec about which is the expected behaviour,
 so in my opinion BaseX is doing nothing wrong here.

Well, if you agree that chopping may alter the semantics of a document,
wouldn't you agree that applying such a transformation *by default* is a
bad idea?

With respect to the XML specification, section 2.10 White Space
Handling says:

  An XML processor MUST always pass all characters in a document that
  are not markup through to the application.

Yes, the spec is vague wrt. to whitespace handling, and the existence of
the xml:space attribute shows that different behaviors--including
potentially corrupting ones--are possible.  I would therefore interpret
the spec to mean that by default all characters should be preserved, but
that other behaviors are possible.

 I see that this behaviour might be surprising for some users, but this
 might as well be the case if it were the other way round.

No, because their documents wouldn't be corrupted.  You can easily
remove all whitespace afterwards if you decide you don't need it, but
once it's gone, it's gone and cannot be restored.  That's the problem.

 Additionally, if we would change this now it would break application
 code and unless there is a good reason (i.e. BaseX is actually doing
 something wrong or non-compliant) I don't see why one should change
 the default.

Well, I'm not on a crusade or anything, so if you believe that it's a
good idea to corrupt, by default, all documents containing mixed content
on import, or if this behavior must be kept for compatiblity, so be it.
I just wanted to point out that whitespace chopping may, in fact, alter
the semantics of documents--it's not as harmless as it may seem.

Best regards

-- 
Dr.-Ing. Michael Piotrowski, M.A. m...@cl.uzh.ch
Institute of Computational Linguistics, University of Zurich
Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044
* OUT NOW: Natural Language Processing for Historical Texts
* http://morganclaypool.com/doi/abs/10.2200/S00436ED1V01Y201207HLT017
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-05 Thread Michael Seiferle
Michael (other than me :-)) you are obviously right.


—
Mit freundlichen Grüßen
Michael Seiferle

On Fri, Apr 5, 2013 at 12:29 PM, Michael Piotrowski m...@cl.uzh.ch wrote:

 Dirk,
 On 2013-04-05, Dirk Kirsten d...@basex.org wrote:
 You are certainly right that with mixed content and the example you have
 given here chopping does make a semantic difference.
 However, you can disable this behaviour so BaseX does what you want. So the
 only reason I see why one should change the default behaviour would be
 because the default is not confirmant to some XML standard. However, I can
 not find any specifics in the spec about which is the expected behaviour,
 so in my opinion BaseX is doing nothing wrong here.
 Well, if you agree that chopping may alter the semantics of a document,
 wouldn't you agree that applying such a transformation *by default* is a
 bad idea?
 With respect to the XML specification, section 2.10 White Space
 Handling says:
   An XML processor MUST always pass all characters in a document that
   are not markup through to the application.
 Yes, the spec is vague wrt. to whitespace handling, and the existence of
 the xml:space attribute shows that different behaviors--including
 potentially corrupting ones--are possible.  I would therefore interpret
 the spec to mean that by default all characters should be preserved, but
 that other behaviors are possible.
 I see that this behaviour might be surprising for some users, but this
 might as well be the case if it were the other way round.
 No, because their documents wouldn't be corrupted.  You can easily
 remove all whitespace afterwards if you decide you don't need it, but
 once it's gone, it's gone and cannot be restored.  That's the problem.
 Additionally, if we would change this now it would break application
 code and unless there is a good reason (i.e. BaseX is actually doing
 something wrong or non-compliant) I don't see why one should change
 the default.
 Well, I'm not on a crusade or anything, so if you believe that it's a
 good idea to corrupt, by default, all documents containing mixed content
 on import, or if this behavior must be kept for compatiblity, so be it.
 I just wanted to point out that whitespace chopping may, in fact, alter
 the semantics of documents--it's not as harmless as it may seem.
 Best regards
 -- 
 Dr.-Ing. Michael Piotrowski, M.A. m...@cl.uzh.ch
 Institute of Computational Linguistics, University of Zurich
 Phone +41 44 63-54313 | OpenPGP public key ID 0x1614A044
 * OUT NOW: Natural Language Processing for Historical Texts
 * http://morganclaypool.com/doi/abs/10.2200/S00436ED1V01Y201207HLT017
 ___
 BaseX-Talk mailing list
 BaseX-Talk@mailman.uni-konstanz.de
 https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk


Re: [basex-talk] whitespace around comments

2013-04-05 Thread jidanni
http://www.w3.org/TR/REC-xml/#sec-white-space
...On the other hand, significant white space that should be preserved...

So since your parser by default creates significant whitespace where there was 
none,
and removes it where there was, perhaps it could be fixed please, without the 
user
needing to take special steps. Also that would make doc() agree with let:= as I 
mentioned
above.
___
BaseX-Talk mailing list
BaseX-Talk@mailman.uni-konstanz.de
https://mailman.uni-konstanz.de/mailman/listinfo/basex-talk