Re: lxml/ElementTree and .tail

2006-11-19 Thread Uche Ogbuji

Fredrik Lundh wrote:
 Uche Ogbuji wrote:

  I certainly have never liked the aspects of the ElementTree API under
  present discussion.  But that's not as important as the fact that I
  think the above statement is misleading.  There has always been a
  battle in XML between the people who think the serialization is
  preeminent, and those who believe some data model is preeminent, but
  the reality is that XML 1.0 (an 1.1) is a spec *defined* by its
  serialization.

 sure, the computing world is and has always been full of people who want
 the simplest thing to look a lot harder than it actually is.  after all,
 *they* spent lots of time reading all the specifications, they've bought
 all the books, and went to all the seminars, so it's simply not fair
 when others are cheating.

You sound bitter about something.  Don't worry, it's really not all
that serious.

 in reality, *all* interchange formats are easier to understand and use
 if you focus on a (complete or intentionally simplified) data model of
 the things being interchanged, and treat various artifacts of the
 byte-stream used by the wire format as artifacts, historical accidents
 based on what specification happened to be written before the other, or
 what some guy did or did not do in the seventies, as accidents, and
 esoteric arcana disseminated on limited-distribution mailing lists as
 about as relevant for your customer as last week's episode of American Idol.

The fact that the XML Infoset is hardly used outside W3C XML Schema,
and that the XPath data model is far more common, and that focus on the
serialization is even more common than that is a matter of everyday
practicality.

And oh by the way, this thread is all about *your* customer's
complaining.  And your response is to give them your philosophical take
on XML.  Doesn't that contradict what you're saying above?

Oh never mind.  You posted something misleading, and I posted another
point of view.  I know you're incapable of any disagreement that
doesn't devolve into a full-scale flame-war.  Sometimes I have time for
that sort of thing.  This is not one fo those times, so this is
probably where I get off.

--
Uche Ogbuji   Fourthought, Inc.
http://uche.ogbuji.nethttp://fourthought.com
http://copia.ogbuji.net   http://4Suite.org
Articles: http://uche.ogbuji.net/tech/publications/

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-19 Thread Uche Ogbuji
Paul McGuire wrote:
 Thankfully, I'm largely on the periphery of that universe (except for being
 a sometimes victim).  But it is certainly frustrating to see many of the OMG
 concepts of the 90's reimplemented in Java services, and then again in
 XML/SOAP, with no detectable awareness that these messaging and
 serialization problems have been considered before, and much more
 thoroughly.

You'll be surprised at how many XMLers agree that Web services are a
pretty inept reinvention of CORBA.  I was pretty much slain by this
take:

http://wanderingbarque.com/nonintersecting/2006/11/15/the-s-stands-for-simple

I think Duncan Grisby of OmniORB put it most succintly when he pointed
out that SOAP and friends are more complex, more bloated, and less
interoprable than CORBA ever was.  But they use XML so they get the
teacher's pet treatment.


 I liked XML when I could read it and hack it out in Notepad.

You still can, and don't let anyone tell you otherwise.  I've always
argued that XML doesn't work unless it's Notepad-hackable.  I do
usually allow an exception for SVG.

 I like
 attributes, which puts me on the outs with most XML zealots who forswear the
 use of attributes on purely academic grounds (they defeat the future
 possible expansion of an attribute's value into more complex substructure).

Really?  Do you have any references for this?  I haven't seen much
criticism of attributes since the very early days, and almost all XML
technologies make heavy use of attributes.  Here's my take:

http://www.ibm.com/developerworks/xml/library/x-eleatt.html

As you can see, elements and attributes get equal billing.

 I dislike namespaces, especially the default xmlns kind, as they make me
 take extra steps when retrieving nodes via Xpaths; and everyone seems to
 think their application needs namespaces, when there is no threat that these
 tags will ever get mixed up with anyone else's.

Namespaces are possibly the worst thing to have ever happened to XML.
Again, my take:

http://www.ibm.com/developerworks/xml/library/x-namcar.html

And yes, default namespaces are about 50% of the problem with
namespace.  QNames in content (which are of course an abuse of
namespaces) are almost all of the other 50%.  I call them hidden
namespaces:

http://copia.ogbuji.net/blog/2006-08-14/Some_thoug

--
Uche Ogbuji   Fourthought, Inc.
http://uche.ogbuji.nethttp://fourthought.com
http://copia.ogbuji.net   http://4Suite.org
Articles: http://uche.ogbuji.net/tech/publications/

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-19 Thread Diez B. Roggisch
 You'll be surprised at how many XMLers agree that Web services are a
 pretty inept reinvention of CORBA.  I was pretty much slain by this
 take:
 
 http://wanderingbarque.com/nonintersecting/2006/11/15/the-s-stands-for-simple

Thanks for that! Sums up nicely my experiences, and gave me a good chuckle!

While I liked the idea of AXIS reflecting my java code in the first 
place (as long as interoperability only meant I can test my own code), 
it sucked s hard when trying to make it work with anything else 
(including python of course).

And I don't know why I've complained about this style of inverse 
interface generation on so many other occasions (e.g. COM interfaces in 
VStudio, JBuilder GUI design and so on), but could never quite put the 
finger on what disturbed me on SOAP.

Probably because looking at a WSDL it immediately made me shrink away 
from that mess and hope that there must be _some_ merciful deity that 
will produce that crap for me, so that I never asked myself  the right 
questions

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-19 Thread Fredrik Lundh
Uche Ogbuji wrote:

  The fact that the XML Infoset is hardly used outside W3C XML Schema,
  and that the XPath data model is far more common, and that focus on
  the serialization is even more common than that is a matter of
  everyday practicality.

everyday interoperability problems, that is.  yesterday, someone 
reported a bug in Python's xml.dom because he couldn't get it to 
serialize the string nbsp; as nbsp;.  earlier today, someone
asked how to work around an XML parser that didn't understand
namespace prefixes.

 And oh by the way, this thread is all about *your* customer's
 complaining.

from what I can tell, it was *your* customer posting FUD about a 
different library, not my customer asking for help with a specific 
problem.  this is free software; people who use a piece of software 
count a *lot* more than people who don't want to use it.

  This is not one fo those times, so this is probably where I get off.

I'll be looking forward to your next O'Reilly article.

/F

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-19 Thread Chas Emerick
On Nov 19, 2006, at 9:55 AM, Fredrik Lundh wrote:

 And oh by the way, this thread is all about *your* customer's
 complaining.

 from what I can tell, it was *your* customer posting FUD about a
 different library, not my customer asking for help with a specific
 problem.  this is free software; people who use a piece of software
 count a *lot* more than people who don't want to use it.

Holy hell Fredrik -- I hadn't even *downloaded* 4suite before I  
posted my original question.  I've tried to be nice, tried to be  
complimentary, and tried to be diplomatic, so it would be nice if  
*everyone* would stop casting aspersions or otherwise speculating  
about my intentions.  Flame amongst yourselves, but leave me out of it.

- Chas
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-19 Thread Fredrik Lundh
Uche Ogbuji wrote:

 The fact that the XML Infoset is hardly used outside W3C XML Schema,
 and that the XPath data model is far more common,

and for the bystanders, it should be noted that the Infoset is pretty 
much the same thing as the XPath data model; it's mostly just that the 
specifications use different names for the same concept.  if you cut 
through the vocabulary, it's all about a tree of elements, plus text and 
attributes and a few more (but usually less interesting) things.  it's a 
bit like arguing that

 class Person(object):
 __slots__ = [name]
 def __init__(self, name):
 self.name = name

and

 class Employee:
 def __init__(self, first_name, last_name):
 self.full_name = first_name +   + last_name

and

 employee_name = ...

are entirely different things, and not just three more or less con- 
venient ways to store exactly the same information.

/F

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-19 Thread Damjan
 sure, the computing world is and has always been full of people who want
 the simplest thing to look a lot harder than it actually is.  after all,
 *they* spent lots of time reading all the specifications, they've bought
 all the books, and went to all the seminars, 

and have been sold all the expensive proprietary tools

 so it's simply not fair when others are cheating.

-- 
damjan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-18 Thread Fredrik Lundh
Uche Ogbuji wrote:

 I certainly have never liked the aspects of the ElementTree API under
 present discussion.  But that's not as important as the fact that I
 think the above statement is misleading.  There has always been a
 battle in XML between the people who think the serialization is
 preeminent, and those who believe some data model is preeminent, but
 the reality is that XML 1.0 (an 1.1) is a spec *defined* by its
 serialization.

sure, the computing world is and has always been full of people who want 
the simplest thing to look a lot harder than it actually is.  after all, 
*they* spent lots of time reading all the specifications, they've bought 
all the books, and went to all the seminars, so it's simply not fair 
when others are cheating.

in reality, *all* interchange formats are easier to understand and use 
if you focus on a (complete or intentionally simplified) data model of 
the things being interchanged, and treat various artifacts of the 
byte-stream used by the wire format as artifacts, historical accidents 
based on what specification happened to be written before the other, or 
what some guy did or did not do in the seventies, as accidents, and 
esoteric arcana disseminated on limited-distribution mailing lists as 
about as relevant for your customer as last week's episode of American Idol.

(XML is a bit unusual in this respect, but that's probably just some 
variation of the bikeshed effect.  it's just text, and everyone with
a keyboard knows what that is, so we don't need to use established 
software engineering practices, or think about security *at all* 
(Billion laughs? XXE?) or, for that matter, learn from people who's
been doing data interchange in other domains since the dawn of time. 
and when they do appear anyway, and mess with our technology in ways 
that we haven't authorized, without reading our books or going to our 
seminars or subscribing to our mailing lists, we can write them off as 
clueless muppet teenage genius code-jockeys, and keep patting our- 
selves on the back, while the rest of the world is busy routing around 
us, switching to well-understood XML subsets or other serialization 
formats, simpler and more flexible data models, simpler API:s, and
more robust code.  and Python ;-)

/F

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-18 Thread Paul McGuire
Fredrik Lundh [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]

 (XML is a bit unusual in this respect, but that's probably just some 
 variation of the bikeshed effect.  it's just text, and everyone with
 a keyboard knows what that is, so we don't need to use established 
 software engineering practices, or think about security *at all* (Billion 
 laughs? XXE?) or, for that matter, learn from people who's
 been doing data interchange in other domains since the dawn of time. and 
 when they do appear anyway, and mess with our technology in ways that we 
 haven't authorized, without reading our books or going to our seminars or 
 subscribing to our mailing lists, we can write them off as clueless 
 muppet teenage genius code-jockeys, and keep patting our- selves on the 
 back, while the rest of the world is busy routing around us, switching to 
 well-understood XML subsets or other serialization formats, simpler and 
 more flexible data models, simpler API:s, and
 more robust code.  and Python ;-)


maybe time to switch to decaf... :)



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-18 Thread Fredrik Lundh
Paul McGuire wrote:

 maybe time to switch to decaf... :)

do you disagree with my characterization of the state of the XML universe?

/F

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-18 Thread Paul McGuire
Fredrik Lundh [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]
 Paul McGuire wrote:

 maybe time to switch to decaf... :)

 do you disagree with my characterization of the state of the XML universe?

 /F

Thankfully, I'm largely on the periphery of that universe (except for being 
a sometimes victim).  But it is certainly frustrating to see many of the OMG 
concepts of the 90's reimplemented in Java services, and then again in 
XML/SOAP, with no detectable awareness that these messaging and 
serialization problems have been considered before, and much more 
thoroughly.

I liked XML when I could read it and hack it out in Notepad.  I like 
attributes, which puts me on the outs with most XML zealots who forswear the 
use of attributes on purely academic grounds (they defeat the future 
possible expansion of an attribute's value into more complex substructure). 
I dislike namespaces, especially the default xmlns kind, as they make me 
take extra steps when retrieving nodes via Xpaths; and everyone seems to 
think their application needs namespaces, when there is no threat that these 
tags will ever get mixed up with anyone else's.

No, I was mostly amused (which I thought was your intent, given the trailing 
smiley) at your breathless, quasi-rant against the XML milieu in general - I 
think your one sentence went on for about 15 lines!

-- Paul 


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-18 Thread Chas Emerick

On Nov 18, 2006, at 5:09 AM, Fredrik Lundh wrote:

 Uche Ogbuji wrote:

 I certainly have never liked the aspects of the ElementTree API under
 present discussion.  But that's not as important as the fact that I
 think the above statement is misleading.  There has always been a
 battle in XML between the people who think the serialization is
 preeminent, and those who believe some data model is preeminent, but
 the reality is that XML 1.0 (an 1.1) is a spec *defined* by its
 serialization.

 sure, the computing world is and has always been full of people who  
 want
 the simplest thing to look a lot harder than it actually is.  after  
 all,
 *they* spent lots of time reading all the specifications, they've  
 bought
 all the books, and went to all the seminars, so it's simply not fair
 when others are cheating.

[snip]

 and keep patting our-
 selves on the back, while the rest of the world is busy routing around
 us, switching to well-understood XML subsets or other serialization
 formats, simpler and more flexible data models, simpler API:s, and
 more robust code.  and Python ;-)

That's flatly unrealistic.  If you'll remember, I'm not one of those  
people that are specification-driven -- I hadn't even *heard* of  
Infoset until earlier this week!  However, I am driven to ensure that  
the code I (and we) write works *as others expect* when confronted by  
any of the billions of XML documents out there.  Simpler is better,  
and better is better (thus why I am in python-land), unless that  
simplicity makes it difficult to play nicely with others.  Shrugging  
off the way everyone else does things reminds me of various CSS  
fanatics I know of that simply won't use tables or IE CSS  
compatibility hacks, even if that's what's needed to get things to work.

I've never been involved in any XML battles, but to Uche's point, I  
would speculate (only on the basis of personal interactions and  
anecdotes) that some overwhelming majority of the developers out  
there care for nothing but the serialization, simply because that's  
how one plays nicely with others.  I would count myself in that group  
as well, although I do recognize that there is a worthy academic  
exercise in exploring the data-model-centric XML worldview.

OT: Uche, 4suite XML is tops!  Thank you very much for that.

- Chas
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-18 Thread Fredrik Lundh
Chas Emerick wrote:

 and keep patting our-
 selves on the back, while the rest of the world is busy routing around
 us, switching to well-understood XML subsets or other serialization
 formats, simpler and more flexible data models, simpler API:s, and
 more robust code.  and Python ;-)
 
 That's flatly unrealistic.  If you'll remember, I'm not one of those  
 people that are specification-driven -- I hadn't even *heard* of  
 Infoset until earlier this week!

The rant wasn't directed at you or anyone special, but I don't really 
think you got the point of it either.  Which is a bit strange, because 
it sounded like you *were* working on extracting information from messy 
documents, so the it's about the data, dammit way of thinking 
shouldn't be news to you.

And the routing around is not unrealistic, it's is a *fact*; JSON and 
POX are killing the full XML/Schema/SOAP stack for communication, XHTML 
is pretty much dead as a wire format, people are apologizing in public 
for their use of SOAP, AJAX is quickly turning into AJAJ, few people 
care about the more obscure details of the XML 1.0 standard (when did 
you last see a conditional section? or even a DTD?), dealing with huge 
XML data sets is still extremely hard compared to just uploading the 
darn thing to a database and doing the crunching in SQL, and nobody uses 
XML 1.1 for anything.

Practicality beats purity, and the Internet routes around damage, every 
single time.

  overwhelming majority of the developers out there care for nothing
  but the serialization, simply because that's how one plays nicely
  with others.

The problem is if you only stare at the serialization, your code *won't* 
play nicely with others.  At the serialization level, it's easy to think 
that CDATA sections are different from other text, that character 
references are different from ordinary characters, that you should 
somehow be able to distinguish between tag/tag and tag/, that 
namespace prefixes are more important than the namespace URI, that an 
nbsp; in an XHTML-style stream is different from a U+00A0 character in 
memory, and so on.  In my experience, serialization-only thinking (at 
the receiving end) is the single most common cause for interoperability 
problems when it comes to general XML interchange.

But when you focus on the data model, and treat the serialization as an 
implementation detail, to be addressed by a library written by someone 
who's actually read the specifications a few more times than you have, 
all those problems tend to just go away.  Things just work.

And in practice, of course, most software engineers understand this, and 
care about this.  After all, good software engineering is about 
abstractions and decoupling and designing things so you can focus on one 
part of the problem at a time.  And about making your customer happy, 
and having fun while doing that.  Not staying up all night to look for 
an obscure interoperability problem that you finally discover is caused 
by someone using a CDATA section where you expected a character 
reference, in 0.1% of all production records, but in none of the files 
in your test data set.

(By the way, did ET fail to *read* your XML documents?  I thought your 
complaint was that it didn't put the things it read in a place where you 
expected them to be, and that you didn't have time to learn how to deal 
with that because you had more important things to do, at the time?)

/F

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-18 Thread Chas Emerick

On Nov 18, 2006, at 11:29 AM, Fredrik Lundh wrote:

 Chas Emerick wrote:

 and keep patting our-
 selves on the back, while the rest of the world is busy routing  
 around
 us, switching to well-understood XML subsets or other serialization
 formats, simpler and more flexible data models, simpler API:s, and
 more robust code.  and Python ;-)

 That's flatly unrealistic.  If you'll remember, I'm not one of those
 people that are specification-driven -- I hadn't even *heard* of
 Infoset until earlier this week!

 The rant wasn't directed at you or anyone special, but I don't really
 think you got the point of it either.  Which is a bit strange, because
 it sounded like you *were* working on extracting information from  
 messy
 documents, so the it's about the data, dammit way of thinking
 shouldn't be news to you.

No, it's not any kind of news at all, and I'm very sympathetic to  
your specific perspective (and have advocated it in other contexts  
and circumstances, where appropriate).  And yes, we are in fact  
ensuring that we get from the HTML/XHTML/text/PDF/etc serialization  
we have to consume to a uniform, normalized, and clean data model  
in as few steps as possible.  However, in those few steps, we have to  
recognize the functional reality of how each data representation is  
used out in the world in order to translate it into a uniform model  
for our own purposes.  In concrete terms, that means that an end tag  
in an XHTML serialization means that that element is closed, done,  
finit.  Any other representation of that serialization doesn't  
correspond properly with the intent of that HTML document's author.

 And the routing around is not unrealistic, it's is a *fact*; JSON and
 POX are killing the full XML/Schema/SOAP stack for communication,  
 XHTML
 is pretty much dead as a wire format, people are apologizing in public
 for their use of SOAP, AJAX is quickly turning into AJAJ, few people
 care about the more obscure details of the XML 1.0 standard (when did
 you last see a conditional section? or even a DTD?), dealing with huge
 XML data sets is still extremely hard compared to just uploading the
 darn thing to a database and doing the crunching in SQL, and nobody  
 uses
 XML 1.1 for anything.

 Practicality beats purity, and the Internet routes around damage,  
 every
 single time.

I agree 100% -- but I would have thought that that's a point I would  
have made.  The model that ET uses seems like a purified  
representation of a mixed-content serialization, exactly because it  
is geared to an ideal rather than the practical realities of mixed  
content and expectations thereof.

For what it's worth, our current effort is directed towards providing  
significant stores/feeds of XML/PDF/HTML/text/etc in something that  
can be dropped into a RDBMS.  Perhaps that's the source of the  
impedance between us: you view Infoset as a functional replacement  
for serialization-dependent XML, whereas we are focussed on what  
could be broadly described as a translation from one to the other.

 overwhelming majority of the developers out there care for nothing
 but the serialization, simply because that's how one plays nicely
 with others.

 The problem is if you only stare at the serialization, your code  
 *won't*
 play nicely with others.  At the serialization level, it's easy to  
 think
 that CDATA sections are different from other text, that character
 references are different from ordinary characters, that you should
 somehow be able to distinguish between tag/tag and tag/, that
 namespace prefixes are more important than the namespace URI, that an
 nbsp; in an XHTML-style stream is different from a U+00A0  
 character in
 memory, and so on.  In my experience, serialization-only thinking (at
 the receiving end) is the single most common cause for  
 interoperability
 problems when it comes to general XML interchange.

I agree with all of that.  I would again refer to the pervasive view  
of what end tags mean -- that's what I was primarily referring to  
with the term 'serialization'.

 (By the way, did ET fail to *read* your XML documents?  I thought your
 complaint was that it didn't put the things it read in a place  
 where you
 expected them to be, and that you didn't have time to learn how to  
 deal
 with that because you had more important things to do, at the time?)

No, it doesn't put things in the right places, so I consider that a  
failure of the model.  I don't see why I should have spent time  
learning how to deal with that when another very comprehensive  
library is available that does meet expectations.  *shrug*

Further, the fact that ET/lxml works the way that it does makes me  
think that there may be some other landmines in the underlying model  
that we might not have discovered until some days, weeks, etc., had  
passed, so there's a much greater comfort level in working with a  
library that explicitly supports the model that we expect (and was  
assumed when 

Re: lxml/ElementTree and .tail

2006-11-18 Thread Fredrik Lundh
Chas Emerick wrote:

 Further, the fact that ET/lxml works the way that it does makes me  
 think that there may be some other landmines in the underlying model  
 that we might not have discovered until some days, weeks, etc., had  
 passed

so the real reason you posted your original post was to spread some FUD, 
not to get help?  that's a bit disappointing.

/F

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-18 Thread Chas Emerick
On Nov 18, 2006, at 1:12 PM, Fredrik Lundh wrote:

 Chas Emerick wrote:

 Further, the fact that ET/lxml works the way that it does makes me
 think that there may be some other landmines in the underlying model
 that we might not have discovered until some days, weeks, etc., had
 passed

 so the real reason you posted your original post was to spread some  
 FUD,
 not to get help?  that's a bit disappointing.

sarcasm
Yeah, that's exactly it.  In fact, if you look back at the head of  
this thread, you'll see how I was looking to disparage ET.  I  
especially wanted to make sure ET's API doesn't get any traction in  
the python community.  It's especially important that ET not find  
popular success and acclaim -- I'd have quite a bit to gain from it  
remaining a niche library.
/sarcasm

Fredrik, I wasn't attempting to spread anything.  I was confused, I  
posed some illustrative examples, and asked for people's thoughts.   
Your reply gave me the right vocabulary to find more information  
(i.e. about Infoset), and I replied with a overview of what I had  
learned so as to benefit anyone with similar questions or confusion  
in the future.  A discussion ensued.

ET (and lxml) is obviously extremely successful, widely used, and for  
good reason.  It's just not right for us, but you incorrectly  
surmised that I was simply lazy by not modifying/extending ET/lxml to  
make it suitable for our purposes even when other libraries existed  
that better meshed with our requirements.  I tried to answer as  
straightforwardly as possible, and (regrettably, it turns out)  
included the fact that I had worried that our apparent conceptual  
differences indicated that we might find other instances where ET/ 
lxml works differently than we would expect.  I think that's very  
rational, and doesn't speak poorly of ET in any way (especially given  
its obvious success elsewhere).

- Chas
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-17 Thread Uche Ogbuji
Fredrik Lundh wrote:
 Chas Emerick wrote:
  If I'm wrong, just chalk it up to the fact that this is the first
  time I've ever looked at the Infoset spec, and I'm simply confused.

 the Infoset spec *is* the essence of XML; if you don't realize that an
 XML document is just a serialization of a very simple data model, you're
 bound to be fighting with XML all the time.

I certainly have never liked the aspects of the ElementTree API under
present discussion.  But that's not as important as the fact that I
think the above statement is misleading.  There has always been a
battle in XML between the people who think the serialization is
preeminent, and those who believe some data model is preeminent, but
the reality is that XML 1.0 (an 1.1) is a spec *defined* by its
serialization.  Infoset is a secondary and optional spec.  In fact, I
think it's clear that Infoset is not even the preeminent *data model*
of the XML world.  That distinction goes to the XPath data model, which
is quite different from the Infoset.

--
Uche Ogbuji   Fourthought, Inc.
http://uche.ogbuji.nethttp://fourthought.com
http://copia.ogbuji.net   http://4Suite.org
Articles: http://uche.ogbuji.net/tech/publications/

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-16 Thread Fredrik Lundh
Stefan Behnel wrote:

 If you want to copy part of of removed element back into the tree, feel free
 to do so.

and that can of course be done with a short helper function.

when removing elements from trees, I often set the tag for those 
elements to some garbage value during processing, and then call 
something like

http://effbot.org/zone/element-bits-and-pieces.htm#cleanup

to clean things up before serializing the tree.

/F

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-16 Thread Paul Boddie
Stefan Behnel wrote:


[Remove an element, remove following nodes]

 Yes, it is. Just look at the API. It's an attribute of an Element, isn't it?
 What other API do you know where removing an element from a data structure
 leaves part of the element behind?

I guess it depends on what you regard an element to be...

[...]

 IMHO, DOM has a pretty significant mismatch with Python.

...in the DOM or otherwise:

http://www.w3.org/TR/2006/REC-xml-20060816/#sec-logical-struct

Paul

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-16 Thread Fredrik Lundh
Paul Boddie wrote:

 Yes, it is. Just look at the API. It's an attribute of an Element, isn't it?
 What other API do you know where removing an element from a data structure
 leaves part of the element behind?
 
 I guess it depends on what you regard an element to be...

Stefan said Element, not element.

Element is a class in the ElementTree module, which can be used to 
*represent* an XML element in an XML infoset, including all the data 
*inside* the XML element, and any data *between* that XML element and 
the next one (which is always character data, of course).

It's not very difficult, really; especially if you, as Stefan said, 
think in infoset terms rather a sequence of little piggies terms.

/F

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-16 Thread Paul Boddie
Fredrik Lundh wrote:

 It's not very difficult, really; especially if you, as Stefan said,
 think in infoset terms rather a sequence of little piggies terms.

Are piggies part of the infoset too? Does the Piggie class represent a
piggie from the infoset plus a stretch of the road to the market? ;-)

Paul

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-16 Thread Chas Emerick
Thanks for the comments and thoughts.  I must admit that I have an  
overwhelming feeling of having just stepped into the middle of a  
complex, heated conversation without having heard the preamble.

(FYI, this reply is only an attempt to help those that come  
afterwards -- I'm not looking to advocate much of anything here.)

Fredrik's invocation of the infoset term led me to a couple of  
quick searches that clarified the state of play.  Here he sets the  
stage for the .tail behaviour that I originally posted about:

http://effbot.org/zone/element-infoset.htm

And it looks like there have been tussles over other mismatches in  
expectations before, specifically around how namespaces are handled:

http://groups.google.com/group/comp.lang.python/browse_thread/thread/ 
31b2e9f4a8f7338c
http://nixforums.org/ntopic43901.html

 From what I can see, there are more than a few people that have  
stumbled with ElementTree's API because of their preexisting  
expectations, which others have probably correctly bucketed as  
implementation details.  This comes as quite a shock to those who  
have stumbled (including myself) who have, lo these many years, come  
to view those details as the only standard that matters (perhaps  
simply because those details have been so consistent in our experience).

Which, in my view, is just fine -- different strokes for different  
folks, and all that.  When I originally started poking around the  
python xml world, I was somewhat confused as to why 4suite/Domlette  
existed, as it seemed pretty clear that ElementTree had crystallized  
a lot of mindshare, and has a very attractive API to boot.   
Thankfully, I can now see its appeal, and am very glad it's around,  
as it seems to have all of those comfortable implementation details  
that I've been looking for. :-)

As for the infoset vs. sequence of piggies nut: if ElementTree's  
infoset approach is technically correct, then wouldn't it also be  
correct to use a .head attribute instead of a .tail attribute?  Example:

afirstbmiddle/blast/a

might be represented as:

Element a: head='', text='last'
 Element b: head='first', text='middle'

If I'm wrong, just chalk it up to the fact that this is the first  
time I've ever looked at the Infoset spec, and I'm simply confused.   
If that IS a technically-valid way to represent the above xml  
fragment . . . then I guess I'll make sure to tread more carefully in  
the future around tools that work in infoset terms.  For me, it turns  
out that sequences of piggies really are important, at least in  
contexts where XML is merely a means to an end (either because of the  
attractiveness of the toolsets or because we must cope with what  
we're provided as input) and where consistency with existing tools  
(like those that adhere to DOM level 2/3) and expectations are  
critical.  I think this is what Paul was nodding towards with his  
original response to Stefan's response.

Cheers,

- Chas

On Nov 16, 2006, at 5:11 AM, Fredrik Lundh wrote:

 Paul Boddie wrote:

 Yes, it is. Just look at the API. It's an attribute of an  
 Element, isn't it?
 What other API do you know where removing an element from a data  
 structure
 leaves part of the element behind?

 I guess it depends on what you regard an element to be...

 Stefan said Element, not element.

 Element is a class in the ElementTree module, which can be used to
 *represent* an XML element in an XML infoset, including all the data
 *inside* the XML element, and any data *between* that XML element and
 the next one (which is always character data, of course).

 It's not very difficult, really; especially if you, as Stefan said,
 think in infoset terms rather a sequence of little piggies terms.

 /F
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-16 Thread Fredrik Lundh
Chas Emerick wrote:

  might be represented as:
 
  Element a: head='', text='last'
   Element b: head='first', text='middle'

sure, and you could use a text subtype instead that kept track of the 
elements above it, and let the elements be sequences of their siblings 
instead of their children, and perhaps stuff everything in a dictionary. 
  such a construct would also be able to hold the same data, and be very 
hard to use in most normal situations.

 If I'm wrong, just chalk it up to the fact that this is the first  
 time I've ever looked at the Infoset spec, and I'm simply confused.   

the Infoset spec *is* the essence of XML; if you don't realize that an 
XML document is just a serialization of a very simple data model, you're 
bound to be fighting with XML all the time.

but ET doesn't implement the Infoset spec as it is, of course: it uses a 
*simplified* model, carefully optimized for the large percentage of all 
XML formats that simply doesn't use mixed content.  if you're doing 
document-style processing, you sometimes need to add an extra assignment 
or two, but unless you're doing *only* document-style processing, ET's 
API gives you a net win.  (and even if you're doing only document-style 
processing, ET's speed and memory footprint gives you a net win over 
most competing technologies).

/F

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-16 Thread Chas Emerick
On Nov 16, 2006, at 7:25 AM, Fredrik Lundh wrote:

 If I'm wrong, just chalk it up to the fact that this is the first
 time I've ever looked at the Infoset spec, and I'm simply confused.

 the Infoset spec *is* the essence of XML; if you don't realize that an
 XML document is just a serialization of a very simple data model,  
 you're
 bound to be fighting with XML all the time.

The principle and the practice diverge significantly in our neck of  
the woods.  The current project involves consuming and making sense  
of extraordinarily (and typically unnecessarily) complex XHTML.  Of  
course, as you say, those documents are still serializations of a  
simple data model, but the types of manipulations we do happen to  
butt up very uncomfortably with the way ET does things.

 but ET doesn't implement the Infoset spec as it is, of course: it  
 uses a
 *simplified* model, carefully optimized for the large percentage of  
 all
 XML formats that simply doesn't use mixed content.  if you're doing
 document-style processing, you sometimes need to add an extra  
 assignment
 or two, but unless you're doing *only* document-style processing, ET's
 API gives you a net win.  (and even if you're doing only document- 
 style
 processing, ET's speed and memory footprint gives you a net win over
 most competing technologies).

Yeah, documents are all we do -- XML just happens to be a pleasant  
intermediate format, and something we need to consume.  The notion of  
an nicely-formatted XML is entirely foreign to the work that we do --  
in fact, our current focus is (in part) dragging decidedly  
unstructured data out of those XHTML documents (among other source  
formats) and putting them into a reasonable, useful structure.

I took some time last night to bang out some functions that squeezed  
ET's model (via lxml) into doing what we need, and it ended up  
requiring a lot more BD than I like.  At that point, I swung over to  
4suite, which dropped into place quite nicely.

*shrug* I guess we're just in the minority with regard to our API  
requirements -- we happen to live in the corner cases.  I'm certainly  
glad to have made the detour on a different path for a bit though.

- Chas
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-16 Thread Fredrik Lundh
Chas Emerick wrote:

 The principle and the practice diverge significantly in our neck of  
 the woods.  The current project involves consuming and making sense  
 of extraordinarily (and typically unnecessarily) complex XHTML.

wasn't your original complaint that ET didn't do the right thing when 
you removed elements from a mixed-content tree? (something than can be 
trivially handled with a 2-line helper function)

why mutate the tree if all you want is to extract information from it? 
doesn't sound very efficient to me...

/F

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-16 Thread Chas Emerick

On Nov 16, 2006, at 8:12 AM, Fredrik Lundh wrote:

 Chas Emerick wrote:

 The principle and the practice diverge significantly in our neck of
 the woods.  The current project involves consuming and making sense
 of extraordinarily (and typically unnecessarily) complex XHTML.

 wasn't your original complaint that ET didn't do the right thing  
 when
 you removed elements from a mixed-content tree? (something than can be
 trivially handled with a 2-line helper function)

Yes, that was the initial issue, but the delta between Elements and  
DOM-style elements leads to other issues.  There's no doubt that the  
needed helpers are simple, but all things being equal, not having to  
carry them around anywhere we're doing DOM manipulations is a big plus.

 why mutate the tree if all you want is to extract information from it?
 doesn't sound very efficient to me...

Because we're far from doing anything that is regular or one-off in  
nature.  We're systematizing the extraction of data from functionally  
unstructured content, and it's flatly necessary to normalize the  
XHTML into something that can be easily consumed by the processes  
we've built that can do that content-data extraction/conversion from  
plain text, XML, PDF, and now XHTML.

Remember, corner cases. :-)

- Chas
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-16 Thread Stefan Behnel
Fredrik Lundh wrote:
 Stefan Behnel wrote:
 
 If you want to copy part of of removed element back into the tree,
 feel free to do so.
 
 and that can of course be done with a short helper function.

Oh, and obviously with a custom Element class in lxml that does this
automatically for you behind the scenes.

http://codespeak.net/lxml/element_classes.html
http://codespeak.net/lxml/element_classes.html#default-class-lookup

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-16 Thread Stefan Behnel
Chas Emerick wrote:
 the delta between Elements and DOM-style elements leads to other issues.
 There's no doubt that the needed helpers are simple, but all things being
 equal, not having to carry them around anywhere we're doing DOM
 manipulations is a big plus.
 
 Because we're far from doing anything that is regular or one-off in nature.
 We're systematizing the extraction of data from functionally unstructured
 content, and it's flatly necessary to normalize the XHTML into something
 that can be easily consumed by the processes we've built that can do that
 content-data extraction/conversion from plain text, XML, PDF, and now
 XHTML.
 
 Remember, corner cases. :-)

Hmm, then I really don't get why you didn't just write a customised XHTML API
on top of lxml's custom Element classes feature. Hiding XML language specific
behaviour directly in the Element classes really helps in getting your code
clean, especially in larger code bases.

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-16 Thread Fredrik Lundh
Paul Boddie wrote:

 It's not very difficult, really; especially if you, as Stefan said,
 think in infoset terms rather a sequence of little piggies terms.

 Are piggies part of the infoset too? Does the Piggie class represent a
 piggie from the infoset plus a stretch of the road to the market? ;-)

no, they just appear in serialized XML.  if you want concrete piggies, you have
to wrap ET's iterparse function, or perhaps the XMLParser class.

/F 



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: lxml/ElementTree and .tail

2006-11-15 Thread Stefan Behnel
Hi,

Chas Emerick wrote:
 I looked around for an ElementTree-specific mailing list, but found none
 -- my apologies if this is too broad a forum for this question.

The lxml mailing list is always happy to receive feedback, but it's fine to
ask here if it's not lxml specific.


 I've been using the lxml variant of the ElementTree API.
 it shares the use of a .tail attribute.  I
 ran headlong into this aspect of the API while doing some DOM
 manipulations, and it's got me pretty confused.
 
 Example:
 
 from lxml import etree as ET
 frag = ET.XML('aheadbinside/btail/a')
 b = frag.xpath('//b')[0]
 b
 Element b at 71cbe8
 b.text
 'inside'
 b.tail
 'tail'
 frag.remove(b)
 ET.tostring(frag)
 'ahead/a'
 
 As you can see, the .tail text is removed as part of the b element --
 but it IS NOT part of the b element.

Yes, it is. Just look at the API. It's an attribute of an Element, isn't it?
What other API do you know where removing an element from a data structure
leaves part of the element behind?

If you want to copy part of of removed element back into the tree, feel free
to do so.


 Performing the same operations with the Java DOM api
 (Sorry for the Java comparison, but that's where I first cut my teeth on
 XML, and that's where my expectations were formed.)
 
 That's a pretty significant mismatch in functionality.

IMHO, DOM has a pretty significant mismatch with Python.


 I ran this issue past a few people I know who've worked with and written
 about ElementTree, and their response to this apparent divergence
 between the ET DOM API and standard DOM APIs was roughly: that's just
 the way it is.

It's just a matter of understanding (or getting used to) the API. You might
want to stop thinking in terms of '' and '' and rather embrace the API
itself as a way to work with the XML Infoset (rather than the XML DOM).

Stefan
-- 
http://mail.python.org/mailman/listinfo/python-list