Re: lxml/ElementTree and .tail
Fredrik Lundh wrote: Uche Ogbuji wrote: I certainly have never liked the aspects of the ElementTree API under present discussion. But that's not as important as the fact that I think the above statement is misleading. There has always been a battle in XML between the people who think the serialization is preeminent, and those who believe some data model is preeminent, but the reality is that XML 1.0 (an 1.1) is a spec *defined* by its serialization. sure, the computing world is and has always been full of people who want the simplest thing to look a lot harder than it actually is. after all, *they* spent lots of time reading all the specifications, they've bought all the books, and went to all the seminars, so it's simply not fair when others are cheating. You sound bitter about something. Don't worry, it's really not all that serious. in reality, *all* interchange formats are easier to understand and use if you focus on a (complete or intentionally simplified) data model of the things being interchanged, and treat various artifacts of the byte-stream used by the wire format as artifacts, historical accidents based on what specification happened to be written before the other, or what some guy did or did not do in the seventies, as accidents, and esoteric arcana disseminated on limited-distribution mailing lists as about as relevant for your customer as last week's episode of American Idol. The fact that the XML Infoset is hardly used outside W3C XML Schema, and that the XPath data model is far more common, and that focus on the serialization is even more common than that is a matter of everyday practicality. And oh by the way, this thread is all about *your* customer's complaining. And your response is to give them your philosophical take on XML. Doesn't that contradict what you're saying above? Oh never mind. You posted something misleading, and I posted another point of view. I know you're incapable of any disagreement that doesn't devolve into a full-scale flame-war. Sometimes I have time for that sort of thing. This is not one fo those times, so this is probably where I get off. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.nethttp://fourthought.com http://copia.ogbuji.net http://4Suite.org Articles: http://uche.ogbuji.net/tech/publications/ -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
Paul McGuire wrote: Thankfully, I'm largely on the periphery of that universe (except for being a sometimes victim). But it is certainly frustrating to see many of the OMG concepts of the 90's reimplemented in Java services, and then again in XML/SOAP, with no detectable awareness that these messaging and serialization problems have been considered before, and much more thoroughly. You'll be surprised at how many XMLers agree that Web services are a pretty inept reinvention of CORBA. I was pretty much slain by this take: http://wanderingbarque.com/nonintersecting/2006/11/15/the-s-stands-for-simple I think Duncan Grisby of OmniORB put it most succintly when he pointed out that SOAP and friends are more complex, more bloated, and less interoprable than CORBA ever was. But they use XML so they get the teacher's pet treatment. I liked XML when I could read it and hack it out in Notepad. You still can, and don't let anyone tell you otherwise. I've always argued that XML doesn't work unless it's Notepad-hackable. I do usually allow an exception for SVG. I like attributes, which puts me on the outs with most XML zealots who forswear the use of attributes on purely academic grounds (they defeat the future possible expansion of an attribute's value into more complex substructure). Really? Do you have any references for this? I haven't seen much criticism of attributes since the very early days, and almost all XML technologies make heavy use of attributes. Here's my take: http://www.ibm.com/developerworks/xml/library/x-eleatt.html As you can see, elements and attributes get equal billing. I dislike namespaces, especially the default xmlns kind, as they make me take extra steps when retrieving nodes via Xpaths; and everyone seems to think their application needs namespaces, when there is no threat that these tags will ever get mixed up with anyone else's. Namespaces are possibly the worst thing to have ever happened to XML. Again, my take: http://www.ibm.com/developerworks/xml/library/x-namcar.html And yes, default namespaces are about 50% of the problem with namespace. QNames in content (which are of course an abuse of namespaces) are almost all of the other 50%. I call them hidden namespaces: http://copia.ogbuji.net/blog/2006-08-14/Some_thoug -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.nethttp://fourthought.com http://copia.ogbuji.net http://4Suite.org Articles: http://uche.ogbuji.net/tech/publications/ -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
You'll be surprised at how many XMLers agree that Web services are a pretty inept reinvention of CORBA. I was pretty much slain by this take: http://wanderingbarque.com/nonintersecting/2006/11/15/the-s-stands-for-simple Thanks for that! Sums up nicely my experiences, and gave me a good chuckle! While I liked the idea of AXIS reflecting my java code in the first place (as long as interoperability only meant I can test my own code), it sucked s hard when trying to make it work with anything else (including python of course). And I don't know why I've complained about this style of inverse interface generation on so many other occasions (e.g. COM interfaces in VStudio, JBuilder GUI design and so on), but could never quite put the finger on what disturbed me on SOAP. Probably because looking at a WSDL it immediately made me shrink away from that mess and hope that there must be _some_ merciful deity that will produce that crap for me, so that I never asked myself the right questions Diez -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
Uche Ogbuji wrote: The fact that the XML Infoset is hardly used outside W3C XML Schema, and that the XPath data model is far more common, and that focus on the serialization is even more common than that is a matter of everyday practicality. everyday interoperability problems, that is. yesterday, someone reported a bug in Python's xml.dom because he couldn't get it to serialize the string nbsp; as nbsp;. earlier today, someone asked how to work around an XML parser that didn't understand namespace prefixes. And oh by the way, this thread is all about *your* customer's complaining. from what I can tell, it was *your* customer posting FUD about a different library, not my customer asking for help with a specific problem. this is free software; people who use a piece of software count a *lot* more than people who don't want to use it. This is not one fo those times, so this is probably where I get off. I'll be looking forward to your next O'Reilly article. /F -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
On Nov 19, 2006, at 9:55 AM, Fredrik Lundh wrote: And oh by the way, this thread is all about *your* customer's complaining. from what I can tell, it was *your* customer posting FUD about a different library, not my customer asking for help with a specific problem. this is free software; people who use a piece of software count a *lot* more than people who don't want to use it. Holy hell Fredrik -- I hadn't even *downloaded* 4suite before I posted my original question. I've tried to be nice, tried to be complimentary, and tried to be diplomatic, so it would be nice if *everyone* would stop casting aspersions or otherwise speculating about my intentions. Flame amongst yourselves, but leave me out of it. - Chas -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
Uche Ogbuji wrote: The fact that the XML Infoset is hardly used outside W3C XML Schema, and that the XPath data model is far more common, and for the bystanders, it should be noted that the Infoset is pretty much the same thing as the XPath data model; it's mostly just that the specifications use different names for the same concept. if you cut through the vocabulary, it's all about a tree of elements, plus text and attributes and a few more (but usually less interesting) things. it's a bit like arguing that class Person(object): __slots__ = [name] def __init__(self, name): self.name = name and class Employee: def __init__(self, first_name, last_name): self.full_name = first_name + + last_name and employee_name = ... are entirely different things, and not just three more or less con- venient ways to store exactly the same information. /F -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
sure, the computing world is and has always been full of people who want the simplest thing to look a lot harder than it actually is. after all, *they* spent lots of time reading all the specifications, they've bought all the books, and went to all the seminars, and have been sold all the expensive proprietary tools so it's simply not fair when others are cheating. -- damjan -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
Uche Ogbuji wrote: I certainly have never liked the aspects of the ElementTree API under present discussion. But that's not as important as the fact that I think the above statement is misleading. There has always been a battle in XML between the people who think the serialization is preeminent, and those who believe some data model is preeminent, but the reality is that XML 1.0 (an 1.1) is a spec *defined* by its serialization. sure, the computing world is and has always been full of people who want the simplest thing to look a lot harder than it actually is. after all, *they* spent lots of time reading all the specifications, they've bought all the books, and went to all the seminars, so it's simply not fair when others are cheating. in reality, *all* interchange formats are easier to understand and use if you focus on a (complete or intentionally simplified) data model of the things being interchanged, and treat various artifacts of the byte-stream used by the wire format as artifacts, historical accidents based on what specification happened to be written before the other, or what some guy did or did not do in the seventies, as accidents, and esoteric arcana disseminated on limited-distribution mailing lists as about as relevant for your customer as last week's episode of American Idol. (XML is a bit unusual in this respect, but that's probably just some variation of the bikeshed effect. it's just text, and everyone with a keyboard knows what that is, so we don't need to use established software engineering practices, or think about security *at all* (Billion laughs? XXE?) or, for that matter, learn from people who's been doing data interchange in other domains since the dawn of time. and when they do appear anyway, and mess with our technology in ways that we haven't authorized, without reading our books or going to our seminars or subscribing to our mailing lists, we can write them off as clueless muppet teenage genius code-jockeys, and keep patting our- selves on the back, while the rest of the world is busy routing around us, switching to well-understood XML subsets or other serialization formats, simpler and more flexible data models, simpler API:s, and more robust code. and Python ;-) /F -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
Fredrik Lundh [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] (XML is a bit unusual in this respect, but that's probably just some variation of the bikeshed effect. it's just text, and everyone with a keyboard knows what that is, so we don't need to use established software engineering practices, or think about security *at all* (Billion laughs? XXE?) or, for that matter, learn from people who's been doing data interchange in other domains since the dawn of time. and when they do appear anyway, and mess with our technology in ways that we haven't authorized, without reading our books or going to our seminars or subscribing to our mailing lists, we can write them off as clueless muppet teenage genius code-jockeys, and keep patting our- selves on the back, while the rest of the world is busy routing around us, switching to well-understood XML subsets or other serialization formats, simpler and more flexible data models, simpler API:s, and more robust code. and Python ;-) maybe time to switch to decaf... :) -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
Paul McGuire wrote: maybe time to switch to decaf... :) do you disagree with my characterization of the state of the XML universe? /F -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
Fredrik Lundh [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Paul McGuire wrote: maybe time to switch to decaf... :) do you disagree with my characterization of the state of the XML universe? /F Thankfully, I'm largely on the periphery of that universe (except for being a sometimes victim). But it is certainly frustrating to see many of the OMG concepts of the 90's reimplemented in Java services, and then again in XML/SOAP, with no detectable awareness that these messaging and serialization problems have been considered before, and much more thoroughly. I liked XML when I could read it and hack it out in Notepad. I like attributes, which puts me on the outs with most XML zealots who forswear the use of attributes on purely academic grounds (they defeat the future possible expansion of an attribute's value into more complex substructure). I dislike namespaces, especially the default xmlns kind, as they make me take extra steps when retrieving nodes via Xpaths; and everyone seems to think their application needs namespaces, when there is no threat that these tags will ever get mixed up with anyone else's. No, I was mostly amused (which I thought was your intent, given the trailing smiley) at your breathless, quasi-rant against the XML milieu in general - I think your one sentence went on for about 15 lines! -- Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
On Nov 18, 2006, at 5:09 AM, Fredrik Lundh wrote: Uche Ogbuji wrote: I certainly have never liked the aspects of the ElementTree API under present discussion. But that's not as important as the fact that I think the above statement is misleading. There has always been a battle in XML between the people who think the serialization is preeminent, and those who believe some data model is preeminent, but the reality is that XML 1.0 (an 1.1) is a spec *defined* by its serialization. sure, the computing world is and has always been full of people who want the simplest thing to look a lot harder than it actually is. after all, *they* spent lots of time reading all the specifications, they've bought all the books, and went to all the seminars, so it's simply not fair when others are cheating. [snip] and keep patting our- selves on the back, while the rest of the world is busy routing around us, switching to well-understood XML subsets or other serialization formats, simpler and more flexible data models, simpler API:s, and more robust code. and Python ;-) That's flatly unrealistic. If you'll remember, I'm not one of those people that are specification-driven -- I hadn't even *heard* of Infoset until earlier this week! However, I am driven to ensure that the code I (and we) write works *as others expect* when confronted by any of the billions of XML documents out there. Simpler is better, and better is better (thus why I am in python-land), unless that simplicity makes it difficult to play nicely with others. Shrugging off the way everyone else does things reminds me of various CSS fanatics I know of that simply won't use tables or IE CSS compatibility hacks, even if that's what's needed to get things to work. I've never been involved in any XML battles, but to Uche's point, I would speculate (only on the basis of personal interactions and anecdotes) that some overwhelming majority of the developers out there care for nothing but the serialization, simply because that's how one plays nicely with others. I would count myself in that group as well, although I do recognize that there is a worthy academic exercise in exploring the data-model-centric XML worldview. OT: Uche, 4suite XML is tops! Thank you very much for that. - Chas -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
Chas Emerick wrote: and keep patting our- selves on the back, while the rest of the world is busy routing around us, switching to well-understood XML subsets or other serialization formats, simpler and more flexible data models, simpler API:s, and more robust code. and Python ;-) That's flatly unrealistic. If you'll remember, I'm not one of those people that are specification-driven -- I hadn't even *heard* of Infoset until earlier this week! The rant wasn't directed at you or anyone special, but I don't really think you got the point of it either. Which is a bit strange, because it sounded like you *were* working on extracting information from messy documents, so the it's about the data, dammit way of thinking shouldn't be news to you. And the routing around is not unrealistic, it's is a *fact*; JSON and POX are killing the full XML/Schema/SOAP stack for communication, XHTML is pretty much dead as a wire format, people are apologizing in public for their use of SOAP, AJAX is quickly turning into AJAJ, few people care about the more obscure details of the XML 1.0 standard (when did you last see a conditional section? or even a DTD?), dealing with huge XML data sets is still extremely hard compared to just uploading the darn thing to a database and doing the crunching in SQL, and nobody uses XML 1.1 for anything. Practicality beats purity, and the Internet routes around damage, every single time. overwhelming majority of the developers out there care for nothing but the serialization, simply because that's how one plays nicely with others. The problem is if you only stare at the serialization, your code *won't* play nicely with others. At the serialization level, it's easy to think that CDATA sections are different from other text, that character references are different from ordinary characters, that you should somehow be able to distinguish between tag/tag and tag/, that namespace prefixes are more important than the namespace URI, that an nbsp; in an XHTML-style stream is different from a U+00A0 character in memory, and so on. In my experience, serialization-only thinking (at the receiving end) is the single most common cause for interoperability problems when it comes to general XML interchange. But when you focus on the data model, and treat the serialization as an implementation detail, to be addressed by a library written by someone who's actually read the specifications a few more times than you have, all those problems tend to just go away. Things just work. And in practice, of course, most software engineers understand this, and care about this. After all, good software engineering is about abstractions and decoupling and designing things so you can focus on one part of the problem at a time. And about making your customer happy, and having fun while doing that. Not staying up all night to look for an obscure interoperability problem that you finally discover is caused by someone using a CDATA section where you expected a character reference, in 0.1% of all production records, but in none of the files in your test data set. (By the way, did ET fail to *read* your XML documents? I thought your complaint was that it didn't put the things it read in a place where you expected them to be, and that you didn't have time to learn how to deal with that because you had more important things to do, at the time?) /F -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
On Nov 18, 2006, at 11:29 AM, Fredrik Lundh wrote: Chas Emerick wrote: and keep patting our- selves on the back, while the rest of the world is busy routing around us, switching to well-understood XML subsets or other serialization formats, simpler and more flexible data models, simpler API:s, and more robust code. and Python ;-) That's flatly unrealistic. If you'll remember, I'm not one of those people that are specification-driven -- I hadn't even *heard* of Infoset until earlier this week! The rant wasn't directed at you or anyone special, but I don't really think you got the point of it either. Which is a bit strange, because it sounded like you *were* working on extracting information from messy documents, so the it's about the data, dammit way of thinking shouldn't be news to you. No, it's not any kind of news at all, and I'm very sympathetic to your specific perspective (and have advocated it in other contexts and circumstances, where appropriate). And yes, we are in fact ensuring that we get from the HTML/XHTML/text/PDF/etc serialization we have to consume to a uniform, normalized, and clean data model in as few steps as possible. However, in those few steps, we have to recognize the functional reality of how each data representation is used out in the world in order to translate it into a uniform model for our own purposes. In concrete terms, that means that an end tag in an XHTML serialization means that that element is closed, done, finit. Any other representation of that serialization doesn't correspond properly with the intent of that HTML document's author. And the routing around is not unrealistic, it's is a *fact*; JSON and POX are killing the full XML/Schema/SOAP stack for communication, XHTML is pretty much dead as a wire format, people are apologizing in public for their use of SOAP, AJAX is quickly turning into AJAJ, few people care about the more obscure details of the XML 1.0 standard (when did you last see a conditional section? or even a DTD?), dealing with huge XML data sets is still extremely hard compared to just uploading the darn thing to a database and doing the crunching in SQL, and nobody uses XML 1.1 for anything. Practicality beats purity, and the Internet routes around damage, every single time. I agree 100% -- but I would have thought that that's a point I would have made. The model that ET uses seems like a purified representation of a mixed-content serialization, exactly because it is geared to an ideal rather than the practical realities of mixed content and expectations thereof. For what it's worth, our current effort is directed towards providing significant stores/feeds of XML/PDF/HTML/text/etc in something that can be dropped into a RDBMS. Perhaps that's the source of the impedance between us: you view Infoset as a functional replacement for serialization-dependent XML, whereas we are focussed on what could be broadly described as a translation from one to the other. overwhelming majority of the developers out there care for nothing but the serialization, simply because that's how one plays nicely with others. The problem is if you only stare at the serialization, your code *won't* play nicely with others. At the serialization level, it's easy to think that CDATA sections are different from other text, that character references are different from ordinary characters, that you should somehow be able to distinguish between tag/tag and tag/, that namespace prefixes are more important than the namespace URI, that an nbsp; in an XHTML-style stream is different from a U+00A0 character in memory, and so on. In my experience, serialization-only thinking (at the receiving end) is the single most common cause for interoperability problems when it comes to general XML interchange. I agree with all of that. I would again refer to the pervasive view of what end tags mean -- that's what I was primarily referring to with the term 'serialization'. (By the way, did ET fail to *read* your XML documents? I thought your complaint was that it didn't put the things it read in a place where you expected them to be, and that you didn't have time to learn how to deal with that because you had more important things to do, at the time?) No, it doesn't put things in the right places, so I consider that a failure of the model. I don't see why I should have spent time learning how to deal with that when another very comprehensive library is available that does meet expectations. *shrug* Further, the fact that ET/lxml works the way that it does makes me think that there may be some other landmines in the underlying model that we might not have discovered until some days, weeks, etc., had passed, so there's a much greater comfort level in working with a library that explicitly supports the model that we expect (and was assumed when
Re: lxml/ElementTree and .tail
Chas Emerick wrote: Further, the fact that ET/lxml works the way that it does makes me think that there may be some other landmines in the underlying model that we might not have discovered until some days, weeks, etc., had passed so the real reason you posted your original post was to spread some FUD, not to get help? that's a bit disappointing. /F -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
On Nov 18, 2006, at 1:12 PM, Fredrik Lundh wrote: Chas Emerick wrote: Further, the fact that ET/lxml works the way that it does makes me think that there may be some other landmines in the underlying model that we might not have discovered until some days, weeks, etc., had passed so the real reason you posted your original post was to spread some FUD, not to get help? that's a bit disappointing. sarcasm Yeah, that's exactly it. In fact, if you look back at the head of this thread, you'll see how I was looking to disparage ET. I especially wanted to make sure ET's API doesn't get any traction in the python community. It's especially important that ET not find popular success and acclaim -- I'd have quite a bit to gain from it remaining a niche library. /sarcasm Fredrik, I wasn't attempting to spread anything. I was confused, I posed some illustrative examples, and asked for people's thoughts. Your reply gave me the right vocabulary to find more information (i.e. about Infoset), and I replied with a overview of what I had learned so as to benefit anyone with similar questions or confusion in the future. A discussion ensued. ET (and lxml) is obviously extremely successful, widely used, and for good reason. It's just not right for us, but you incorrectly surmised that I was simply lazy by not modifying/extending ET/lxml to make it suitable for our purposes even when other libraries existed that better meshed with our requirements. I tried to answer as straightforwardly as possible, and (regrettably, it turns out) included the fact that I had worried that our apparent conceptual differences indicated that we might find other instances where ET/ lxml works differently than we would expect. I think that's very rational, and doesn't speak poorly of ET in any way (especially given its obvious success elsewhere). - Chas -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
Fredrik Lundh wrote: Chas Emerick wrote: If I'm wrong, just chalk it up to the fact that this is the first time I've ever looked at the Infoset spec, and I'm simply confused. the Infoset spec *is* the essence of XML; if you don't realize that an XML document is just a serialization of a very simple data model, you're bound to be fighting with XML all the time. I certainly have never liked the aspects of the ElementTree API under present discussion. But that's not as important as the fact that I think the above statement is misleading. There has always been a battle in XML between the people who think the serialization is preeminent, and those who believe some data model is preeminent, but the reality is that XML 1.0 (an 1.1) is a spec *defined* by its serialization. Infoset is a secondary and optional spec. In fact, I think it's clear that Infoset is not even the preeminent *data model* of the XML world. That distinction goes to the XPath data model, which is quite different from the Infoset. -- Uche Ogbuji Fourthought, Inc. http://uche.ogbuji.nethttp://fourthought.com http://copia.ogbuji.net http://4Suite.org Articles: http://uche.ogbuji.net/tech/publications/ -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
Stefan Behnel wrote: If you want to copy part of of removed element back into the tree, feel free to do so. and that can of course be done with a short helper function. when removing elements from trees, I often set the tag for those elements to some garbage value during processing, and then call something like http://effbot.org/zone/element-bits-and-pieces.htm#cleanup to clean things up before serializing the tree. /F -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
Stefan Behnel wrote: [Remove an element, remove following nodes] Yes, it is. Just look at the API. It's an attribute of an Element, isn't it? What other API do you know where removing an element from a data structure leaves part of the element behind? I guess it depends on what you regard an element to be... [...] IMHO, DOM has a pretty significant mismatch with Python. ...in the DOM or otherwise: http://www.w3.org/TR/2006/REC-xml-20060816/#sec-logical-struct Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
Paul Boddie wrote: Yes, it is. Just look at the API. It's an attribute of an Element, isn't it? What other API do you know where removing an element from a data structure leaves part of the element behind? I guess it depends on what you regard an element to be... Stefan said Element, not element. Element is a class in the ElementTree module, which can be used to *represent* an XML element in an XML infoset, including all the data *inside* the XML element, and any data *between* that XML element and the next one (which is always character data, of course). It's not very difficult, really; especially if you, as Stefan said, think in infoset terms rather a sequence of little piggies terms. /F -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
Fredrik Lundh wrote: It's not very difficult, really; especially if you, as Stefan said, think in infoset terms rather a sequence of little piggies terms. Are piggies part of the infoset too? Does the Piggie class represent a piggie from the infoset plus a stretch of the road to the market? ;-) Paul -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
Thanks for the comments and thoughts. I must admit that I have an overwhelming feeling of having just stepped into the middle of a complex, heated conversation without having heard the preamble. (FYI, this reply is only an attempt to help those that come afterwards -- I'm not looking to advocate much of anything here.) Fredrik's invocation of the infoset term led me to a couple of quick searches that clarified the state of play. Here he sets the stage for the .tail behaviour that I originally posted about: http://effbot.org/zone/element-infoset.htm And it looks like there have been tussles over other mismatches in expectations before, specifically around how namespaces are handled: http://groups.google.com/group/comp.lang.python/browse_thread/thread/ 31b2e9f4a8f7338c http://nixforums.org/ntopic43901.html From what I can see, there are more than a few people that have stumbled with ElementTree's API because of their preexisting expectations, which others have probably correctly bucketed as implementation details. This comes as quite a shock to those who have stumbled (including myself) who have, lo these many years, come to view those details as the only standard that matters (perhaps simply because those details have been so consistent in our experience). Which, in my view, is just fine -- different strokes for different folks, and all that. When I originally started poking around the python xml world, I was somewhat confused as to why 4suite/Domlette existed, as it seemed pretty clear that ElementTree had crystallized a lot of mindshare, and has a very attractive API to boot. Thankfully, I can now see its appeal, and am very glad it's around, as it seems to have all of those comfortable implementation details that I've been looking for. :-) As for the infoset vs. sequence of piggies nut: if ElementTree's infoset approach is technically correct, then wouldn't it also be correct to use a .head attribute instead of a .tail attribute? Example: afirstbmiddle/blast/a might be represented as: Element a: head='', text='last' Element b: head='first', text='middle' If I'm wrong, just chalk it up to the fact that this is the first time I've ever looked at the Infoset spec, and I'm simply confused. If that IS a technically-valid way to represent the above xml fragment . . . then I guess I'll make sure to tread more carefully in the future around tools that work in infoset terms. For me, it turns out that sequences of piggies really are important, at least in contexts where XML is merely a means to an end (either because of the attractiveness of the toolsets or because we must cope with what we're provided as input) and where consistency with existing tools (like those that adhere to DOM level 2/3) and expectations are critical. I think this is what Paul was nodding towards with his original response to Stefan's response. Cheers, - Chas On Nov 16, 2006, at 5:11 AM, Fredrik Lundh wrote: Paul Boddie wrote: Yes, it is. Just look at the API. It's an attribute of an Element, isn't it? What other API do you know where removing an element from a data structure leaves part of the element behind? I guess it depends on what you regard an element to be... Stefan said Element, not element. Element is a class in the ElementTree module, which can be used to *represent* an XML element in an XML infoset, including all the data *inside* the XML element, and any data *between* that XML element and the next one (which is always character data, of course). It's not very difficult, really; especially if you, as Stefan said, think in infoset terms rather a sequence of little piggies terms. /F -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
Chas Emerick wrote: might be represented as: Element a: head='', text='last' Element b: head='first', text='middle' sure, and you could use a text subtype instead that kept track of the elements above it, and let the elements be sequences of their siblings instead of their children, and perhaps stuff everything in a dictionary. such a construct would also be able to hold the same data, and be very hard to use in most normal situations. If I'm wrong, just chalk it up to the fact that this is the first time I've ever looked at the Infoset spec, and I'm simply confused. the Infoset spec *is* the essence of XML; if you don't realize that an XML document is just a serialization of a very simple data model, you're bound to be fighting with XML all the time. but ET doesn't implement the Infoset spec as it is, of course: it uses a *simplified* model, carefully optimized for the large percentage of all XML formats that simply doesn't use mixed content. if you're doing document-style processing, you sometimes need to add an extra assignment or two, but unless you're doing *only* document-style processing, ET's API gives you a net win. (and even if you're doing only document-style processing, ET's speed and memory footprint gives you a net win over most competing technologies). /F -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
On Nov 16, 2006, at 7:25 AM, Fredrik Lundh wrote: If I'm wrong, just chalk it up to the fact that this is the first time I've ever looked at the Infoset spec, and I'm simply confused. the Infoset spec *is* the essence of XML; if you don't realize that an XML document is just a serialization of a very simple data model, you're bound to be fighting with XML all the time. The principle and the practice diverge significantly in our neck of the woods. The current project involves consuming and making sense of extraordinarily (and typically unnecessarily) complex XHTML. Of course, as you say, those documents are still serializations of a simple data model, but the types of manipulations we do happen to butt up very uncomfortably with the way ET does things. but ET doesn't implement the Infoset spec as it is, of course: it uses a *simplified* model, carefully optimized for the large percentage of all XML formats that simply doesn't use mixed content. if you're doing document-style processing, you sometimes need to add an extra assignment or two, but unless you're doing *only* document-style processing, ET's API gives you a net win. (and even if you're doing only document- style processing, ET's speed and memory footprint gives you a net win over most competing technologies). Yeah, documents are all we do -- XML just happens to be a pleasant intermediate format, and something we need to consume. The notion of an nicely-formatted XML is entirely foreign to the work that we do -- in fact, our current focus is (in part) dragging decidedly unstructured data out of those XHTML documents (among other source formats) and putting them into a reasonable, useful structure. I took some time last night to bang out some functions that squeezed ET's model (via lxml) into doing what we need, and it ended up requiring a lot more BD than I like. At that point, I swung over to 4suite, which dropped into place quite nicely. *shrug* I guess we're just in the minority with regard to our API requirements -- we happen to live in the corner cases. I'm certainly glad to have made the detour on a different path for a bit though. - Chas -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
Chas Emerick wrote: The principle and the practice diverge significantly in our neck of the woods. The current project involves consuming and making sense of extraordinarily (and typically unnecessarily) complex XHTML. wasn't your original complaint that ET didn't do the right thing when you removed elements from a mixed-content tree? (something than can be trivially handled with a 2-line helper function) why mutate the tree if all you want is to extract information from it? doesn't sound very efficient to me... /F -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
On Nov 16, 2006, at 8:12 AM, Fredrik Lundh wrote: Chas Emerick wrote: The principle and the practice diverge significantly in our neck of the woods. The current project involves consuming and making sense of extraordinarily (and typically unnecessarily) complex XHTML. wasn't your original complaint that ET didn't do the right thing when you removed elements from a mixed-content tree? (something than can be trivially handled with a 2-line helper function) Yes, that was the initial issue, but the delta between Elements and DOM-style elements leads to other issues. There's no doubt that the needed helpers are simple, but all things being equal, not having to carry them around anywhere we're doing DOM manipulations is a big plus. why mutate the tree if all you want is to extract information from it? doesn't sound very efficient to me... Because we're far from doing anything that is regular or one-off in nature. We're systematizing the extraction of data from functionally unstructured content, and it's flatly necessary to normalize the XHTML into something that can be easily consumed by the processes we've built that can do that content-data extraction/conversion from plain text, XML, PDF, and now XHTML. Remember, corner cases. :-) - Chas -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
Fredrik Lundh wrote: Stefan Behnel wrote: If you want to copy part of of removed element back into the tree, feel free to do so. and that can of course be done with a short helper function. Oh, and obviously with a custom Element class in lxml that does this automatically for you behind the scenes. http://codespeak.net/lxml/element_classes.html http://codespeak.net/lxml/element_classes.html#default-class-lookup Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
Chas Emerick wrote: the delta between Elements and DOM-style elements leads to other issues. There's no doubt that the needed helpers are simple, but all things being equal, not having to carry them around anywhere we're doing DOM manipulations is a big plus. Because we're far from doing anything that is regular or one-off in nature. We're systematizing the extraction of data from functionally unstructured content, and it's flatly necessary to normalize the XHTML into something that can be easily consumed by the processes we've built that can do that content-data extraction/conversion from plain text, XML, PDF, and now XHTML. Remember, corner cases. :-) Hmm, then I really don't get why you didn't just write a customised XHTML API on top of lxml's custom Element classes feature. Hiding XML language specific behaviour directly in the Element classes really helps in getting your code clean, especially in larger code bases. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
Paul Boddie wrote: It's not very difficult, really; especially if you, as Stefan said, think in infoset terms rather a sequence of little piggies terms. Are piggies part of the infoset too? Does the Piggie class represent a piggie from the infoset plus a stretch of the road to the market? ;-) no, they just appear in serialized XML. if you want concrete piggies, you have to wrap ET's iterparse function, or perhaps the XMLParser class. /F -- http://mail.python.org/mailman/listinfo/python-list
Re: lxml/ElementTree and .tail
Hi, Chas Emerick wrote: I looked around for an ElementTree-specific mailing list, but found none -- my apologies if this is too broad a forum for this question. The lxml mailing list is always happy to receive feedback, but it's fine to ask here if it's not lxml specific. I've been using the lxml variant of the ElementTree API. it shares the use of a .tail attribute. I ran headlong into this aspect of the API while doing some DOM manipulations, and it's got me pretty confused. Example: from lxml import etree as ET frag = ET.XML('aheadbinside/btail/a') b = frag.xpath('//b')[0] b Element b at 71cbe8 b.text 'inside' b.tail 'tail' frag.remove(b) ET.tostring(frag) 'ahead/a' As you can see, the .tail text is removed as part of the b element -- but it IS NOT part of the b element. Yes, it is. Just look at the API. It's an attribute of an Element, isn't it? What other API do you know where removing an element from a data structure leaves part of the element behind? If you want to copy part of of removed element back into the tree, feel free to do so. Performing the same operations with the Java DOM api (Sorry for the Java comparison, but that's where I first cut my teeth on XML, and that's where my expectations were formed.) That's a pretty significant mismatch in functionality. IMHO, DOM has a pretty significant mismatch with Python. I ran this issue past a few people I know who've worked with and written about ElementTree, and their response to this apparent divergence between the ET DOM API and standard DOM APIs was roughly: that's just the way it is. It's just a matter of understanding (or getting used to) the API. You might want to stop thinking in terms of '' and '' and rather embrace the API itself as a way to work with the XML Infoset (rather than the XML DOM). Stefan -- http://mail.python.org/mailman/listinfo/python-list