[issue2124] xml.sax and xml.dom fetch DTDs by default
Raynard Sandwick added the comment: I have opened issue #17318 to try to specify the problem better. While I do think that catalogs are the correct fix for the validation use case (and thus would like to see something more out-of-the-box in that vein), the real trouble is that users are often unaware that they're sending requests to DTD URIs, so some combination of fixes in default behavior and/or documentation is definitely needed. The external_ges feature does help, in a way, but is poorly communicated to new users, and moreover does not respect the difference between external DTD subsets and external general entities (there's a reason DOCTYPE isn't spelled ENTITY). The default behavior is not well documented, and the constraining behavior of DTDs is frequently unnecessary. Either a user should have to explicitly enable validation, or it should be irrevocably obvious to a user that validation is the default behavior, and in both cases it should be blatantly documented that validation may cause network side effects. I think the input has been reasonable all around, and yet I find it rather insane that this issue didn't eventually at least result in a documentation fix, thanks to what looks like push-back for push-back's sake, though I will gladly admit the conclusion that it was underspecified is entirely valid. Anyway, further info in the new issue... -- nosy: +rsandwick3 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2124 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Brian Visel aeon.descrip...@gmail.com added the comment: ..still an issue. -- nosy: +Brian.Visel ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2124 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Martin v. Löwis mar...@v.loewis.de added the comment: And my position still remains the same: this is not a bug. Applications affected by this need to use the APIs that are in place precisely to deal with this issue. So I propose to close this report as invalid. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2124 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Brian Visel aeon.descrip...@gmail.com added the comment: Of course, you can do as you like. http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic/ -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2124 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Martin v. Löwis mar...@v.loewis.de added the comment: Well, the issue is clearly underspecified, and different people read different things into it. I take your citation of the W3C blog entry that you are asking that caching should be employed. I read the issue entirely different, namely that by default no attempt to download the DTD should be made, or that the DOM loaders should provide better customization in that matter, or that catalogs shall be used. Given that the issue was underspecified to begin with, I'm now closing it. Anybody who still has an issue here, please open a new issue and report your specific problem, preferably also proposing a solution. If you need to follow up to this message, please do so in private email (mar...@v.loewis.de). -- resolution: - rejected status: open - closed ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2124 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Paul Boddie p...@boddie.org.uk added the comment: Note that Python 3 provided a good opportunity for doing the minimal amount of work here - just stop things from accessing remote DTDs - but I imagine that even elementary standard library improvements of this kind weren't made (let alone the more extensive standard library changes I advocated), so there's going to be a backwards compatibility situation regardless of which Python series is involved now, unfortunately. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2124 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Changes by A.M. Kuchling li...@amk.ca: -- assignee: akuchling - ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2124 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Mark Lawrence breamore...@yahoo.co.uk added the comment: Does anybody know if users are still experiencing problems with this issue? -- nosy: +BreamoreBoy versions: +Python 2.7, Python 3.1, Python 3.2 -Python 2.6 ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2124 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Jean-Paul Calderone exar...@twistedmatrix.com added the comment: Yes. -- ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2124 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Damien Neil ne...@misago.org added the comment: I just ran into this problem. I was very surprised to realize that every time the code I was working on parsed a docbook file, it generated several HTTP requests to oasis-open.org to fetch the docbook DTDs. I attempted to fix the issue by adding an EntityResolver that would cache fetched DTDs. (The documentation on how to do this is not, by the way, very clear.) Unfortunately, this proves to not be possible. The main docbook DTD includes subsidiary DTDs using relative system identifiers. For example, the main DTD at: publicId: -//OASIS//DTD DocBook V4.1//EN systemId: http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd ...includes this second DTD: publicId: -//OASIS//ENTITIES DocBook Notations V4.4//EN systemId: dbnotnx.mod The EntityResolver's resolveEntity() method is not, however, passed the base path to resolve the relative systemId from. This makes it impossible to properly implement a parser which caches fetched DTDs. -- nosy: +damien ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2124 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Changes by Jean-Paul Calderone exar...@divmod.com: -- nosy: +exarkun ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2124 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Jean-Paul Calderone exar...@divmod.com added the comment: Though it's inconvenient to do so, you can arrange to have the locator available from the entity resolver. The content handler's setDocumentLocator method will be called early on with the locator object. So you can give your entity resolver a reference to your content handler and save a reference to the document locator in the content handler. Then in the entity resolver's resolveEntity method you can reach over into the content handler and grab the document locator to call its getSystemId method. Note that you have to be careful with the InputStreams you return from resolveEntity. I wasn't aware of this before (and perhaps I've misinterpreted some observer), but I just noticed that if you return an InputSource based on a file object, the file object's name will be used as the document id! This is quite not what you want. InputStream has a setSystemId method, but even if you call it before you call setByteStream, the system id will be the name of the file object passed to setByteStream. Perhaps calling these two methods in the opposite order will fix this, I'm not sure, I haven't tried. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2124 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Martin v. Löwis mar...@v.loewis.de added the comment: EntityResolver.resolveEntity() is called with the publicId and systemId as arguments. It does not receive a locator. Sure. But ContentHandler.setDocumentLocator receives it, and you are supposed to store it for the entire parse, to always know what entity is being processed if you want to. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2124 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Damien Neil ne...@misago.org added the comment: On Feb 3, 2009, at 1:42 PM, Martin v. Löwis wrote: Sure. But ContentHandler.setDocumentLocator receives it, and you are supposed to store it for the entire parse, to always know what entity is being processed if you want to. Where in the following sequence am I supposed to receive the document locator? parser = xml.sax.make_parser() parser.setEntityResolver(CachingEntityResolver()) doc = xml.dom.minidom.parse('file.xml', parser) The content handler is being created deep inside xml.dom. It does, in fact, store the document locator, but not in any place that I can easily access without breaking several layers of abstraction. Or, as a more general question: How can I get a DOM tree that includes external entities? If there's an easy way to do it, the documentation does not make it clear at all. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2124 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Damien Neil ne...@misago.org added the comment: I just discovered another really fun wrinkle in this. Let's say I want to have my entity resolver return a reference to my local copy of a DTD. I write: source = xml.sax.InputSource() source.setPublicId(publicId) source.setSystemId(systemId) source.setCharacterStream(file(path_to_local_copy)) return source This will appear to work. However, the parser will still silently fetch the DTD over the network! I needed to call source.setByteStream()--character streams are silently ignored. I'd never have noticed this if I hadn't used strace on my process and noticed a slew of recvfrom() calls that shouldn't have been there. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2124 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Jean-Paul Calderone exar...@divmod.com added the comment: It's indeed possible to provide that as a third-party module; one would have to implement an EntityResolver, and applications would have to use it. If there was a need for such a thing, somebody would have done it years ago. I don't think this is true, for several reasons. First, most people never notice that they are writing or using an application which has this behavior. This is because the behavior is transparent in almost all cases, manifesting only as a slowdown. Often, no one is paying close attention to whether a function takes 0.1s or 0.5s. So code gets written which fetches resources from the network by accident. Similarly, users generally don't have any idea that this kind of defect is possible, or they don't think it's unusual behavior. In general, they're not equipped to understand why this is a bad thing. At best, they may decide a program is slow and be upset, but out of the myriad reasons a program might be slow, they have no particular reason to settle on this one as the real cause. Second, it is *difficult* to implement the non-network behavior. Seriously, seriously difficult. The documentation for these APIs is obscure and incomplete in places. It takes a long time to puzzle out what it means and how to achieve the desired behavior. I wouldn't be surprised if many people simply gave up and either switched to another parser or decided they could live with the slowdown (perhaps not realizing that it could be arbitrarily long and might add a network dependency to a program which doesn't already have one). Third, there are several pitfalls on the way to a correct implementation of the non-network behavior which may lead a developer to decide they have succeeded when they have actually failed. The most obvious is that simply turning off the external-general-entities feature appears to solve the problem but actually changes the parser's behavior so that it will silently drop named character entities. This is quite surprising behavior to anyone who hasn't spent a lot of time with the XML specification. So I think it would be a significant improvement if there were a simple, documented way to switch from network retrieval to local retrieval from a cache. I also think that the current default behavior is wrong. The default should not be to go out to the network, even if there is a well-behaved HTTP caching client involved. So the current behavior should be deprecated. After a sufficient period of time, the local-only behavior should be made the default. I don't see any problem with making it easy to re-enable the old behavior, though. -1 on issuing a warning. I really cannot see much of a problem in this entire issue. XML was designed to be straightforwardly usable over the Internet (XML rec., section 1.1), and this issue is a direct consequence of that design decision. You might just as well warn people against using XML in the first place. Quoting part of the XML design goals isn't a strong argument for the current behavior. Transparently requesting network resources in order to process local data isn't a necessary consequence of the straightforwardly usable over the internet goal. Allowing this behavior to be explicitly enabled, but not enabled by default, easily meets this goal. Straightforwardly supporting a local cache of DTDs is even better, since it improves application performance and removes a large number of of security concerns. With the general disfavor of DTDs (in favor of other validation techniques, such as relax-ng) and the general disfavor of named character entities (basically only XHTML uses them), I find it extremely difficult to justify Python's current default behavior. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2124 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Martin v. Löwis mar...@v.loewis.de added the comment: Where in the following sequence am I supposed to receive the document locator? parser = xml.sax.make_parser() parser.setEntityResolver(CachingEntityResolver()) doc = xml.dom.minidom.parse('file.xml', parser) This is DOM parsing, not SAX parsing. The content handler is being created deep inside xml.dom. It does, in fact, store the document locator, but not in any place that I can easily access without breaking several layers of abstraction. So break layers of abstraction, then. Or else, use dom.expatbuilder, and ignore SAX/pulldom for DOM parsing. Or, as a more general question: How can I get a DOM tree that includes external entities? If there's an easy way to do it, the documentation does not make it clear at all. This tracker is really not the place to ask questions; use python-list for that. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2124 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Damien Neil ne...@misago.org added the comment: On Feb 3, 2009, at 11:23 AM, Martin v. Löwis wrote: I don't think this is actually the case. Did you try calling getSystemId on the locator? EntityResolver.resolveEntity() is called with the publicId and systemId as arguments. It does not receive a locator. - Damien ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2124 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Martin v. Löwis mar...@v.loewis.de added the comment: The EntityResolver's resolveEntity() method is not, however, passed the base path to resolve the relative systemId from. This makes it impossible to properly implement a parser which caches fetched DTDs. I don't think this is actually the case. Did you try calling getSystemId on the locator? ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2124 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Damien Neil ne...@misago.org added the comment: On Feb 3, 2009, at 3:12 PM, Martin v. Löwis wrote: This is DOM parsing, not SAX parsing. 1) The title of this ticket begins with xml.sax and xml.dom 2) I am creating a SAX parser and passing it to xml.dom, which uses it. So break layers of abstraction, then. Or else, use dom.expatbuilder, and ignore SAX/pulldom for DOM parsing. Is that really the answer? Read the source code to xml.dom.*, and write hacks based on what I find there? Note also that xml.dom.expatbuilder does not appear to be an external API--there is no mention of it in the documentation for xml.dom.*. This tracker is really not the place to ask questions; use python-list for that. That was a rhetorical question. The answer is, as best I can tell, You can't do that. ___ Python tracker rep...@bugs.python.org http://bugs.python.org/issue2124 ___ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
A.M. Kuchling added the comment: The solution of adding caching, If-Modified-Since, etc. is a good one, but I quail in fear at the prospect of expanding the saxutils resolver into a fully caching HTML agent that uses a cache across processes. We should really be encouraging people to use more capable libraries such as httplib2 (http://code.google.com/p/httplib2/), but this is slightly at war with the batteries-included philosophy. So, I propose we: * add warnings to the urllib, urllib2, saxutil module docs that parsing can retrieve arbitrary resources over the network, and encourage the user to use a smarter library such as httplib2. * update the urllib2 HOWTO to mention this. I'm willing to do the necessary writing. -- assignee: - akuchling priority: urgent - normal __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2124 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Martin v. Löwis added the comment: I may have lost track somewhere: what does have urllib* to do with this issue? __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2124 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Virgil Dupras added the comment: -1 on the systematic warnings too, but what I was talking about is a warning that would say The server you are trying to fetch your resource from is refusing the connection. Don't cha think you misbehave? only on 5xx and 4xx responses, not on every remote resource fetching. __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2124 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
ajaksu added the comment: Martin, I agree that simply not resolving DTDs is an unreasonable request (and said so in the blog post). But IMHO there are lots of possible optimizations, and the most valuable would be those darn easy for newcomers to understand and use. In Python, a winning combo would be an arbitrary (and explicit) FS dtdcache that people could use with simple a drop-in import (from a third-party module?). Perhaps the cache lives in a pickled dictionary with IDs, checksums and DTDs. Could also be a sqlite DB, if updating the dict becomes problematic. In that scenario, AMK could save latter W3C hits with: from xml.sax import make_parser from dtdcache.sax.saxutils import prepare_input_source # - dtdcache parser = make_parser() inp = prepare_input_source('file:file.xhtml', cache=/tmp/xmlcache) It might be interesting to have read-only, force-write and read-write modes. Not sure how to map that on EntityResolver and DTD consumers (I'm no XML user myself). Regarding the std-lib, I believe effective caching hooks for DTDs trump implementing in-memory or sqlite/FS. IMNSHO, correct, accessible support for catalogs shouldn't be the only change, as caching should give better performance on both ends. -- nosy: +ajaksu2 __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2124 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Virgil Dupras added the comment: The blog page talked about 503 responses. What about issuing a warning on these responses? Maybe it would be enough to make developers aware of the problem? Or what about in-memory caching of the DTDs? Sure, it wouldn't be as good as a catalog or anything, but it might help for the worst cases? -- nosy: +vdupras __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2124 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Paul Boddie added the comment: (Andrew, thanks for making a bug, and apologies for not reporting this in a timely fashion.) Although an in-memory caching solution might seem to be sufficient, if one considers things like CGI programs, it's clear that such programs aren't going to benefit from such a solution. It would be interesting to know what widely deployed software does use the affected parsers, though. A Google code search might be helpful. I think that the nicest compatible solution would be to have some kind of filesystem cache for the downloaded resources, but I don't recall any standard library caching solution of this nature. Things like being able to write to a known directory, perhaps using the temporary file APIs which should work even as a very unprivileged user, would be useful properties of such a solution. -- nosy: +pboddie __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2124 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
A.M. Kuchling added the comment: What if we just tried to make the remote accesses apparent to the user, by making a warning.warn() call in the default implementation that was deactivated by a setFeature() call. With a warning, code will continue to run but the user will at least be aware they're hitting a remote resource, and can think about it, even if they decide to suppress the warning. We should also modify the docs to point this out; it's not likely to help very much, but it's still worth doing. __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2124 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Martin v. Löwis added the comment: -1 on issuing a warning. I really cannot see much of a problem in this entire issue. XML was designed to be straightforwardly usable over the Internet (XML rec., section 1.1), and this issue is a direct consequence of that design decision. You might just as well warn people against using XML in the first place. __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2124 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
New submission from A.M. Kuchling: The W3C posted an item at http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic describing how their DTDs are being fetched up to 130M times per day. The Python parsers are part of the problem, as noted by Paul Boddie on the python-advocacy list: There are two places which stand out: xml/dom/xmlbuilder.py xml/sax/saxutils.py What gives them away is the way as the cause of the described problem is that they are both fetching things which are given as system identifiers - the things you get in the document type declaration at the top of an XML document which look like a URL. If you then put some trace statements into the code and then try and parse something using, for example, the xml.sax API, it becomes evident that by default the parser attempts to fetch lots of DTD-related resources, not helped by the way that stuff like XHTML is now modular and thus employs lots of separate files in the DTD. This is obvious because you get something like this printed to the terminal: saxutils: opened http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-inlstyle-1.mod saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-framework-1.mod saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-datatypes-1.mod saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-qname-1.mod saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-events-1.mod saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-attribs-1.mod saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml11-model-1.mod saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-charent-1.mod saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-lat1.ent saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-symbol.ent saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-special.ent saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-text-1.mod saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-inlstruct-1.mod saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-inlphras-1.mod saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-blkstruct-1.mod saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-blkphras-1.mod Of course, the best practice with APIs like SAX is that you define your own resolver or handler classes which don't go and fetch DTDs from the W3C all the time, but this isn't the out of the box behaviour. Instead, implementers have chosen the most convenient behaviour which arguably involves the least effort in telling people how to get hold of DTDs so that they may validate their documents, but which isn't necessarily the right thing in terms of network behaviour. Naturally, since defining specific resolvers/handlers involves a lot of boilerplate (and you should try it in Java!) then a lot of developers just incur the penalty of having the default behaviour, instead of considering the finer points of the various W3C specifications (which is never really any fun). Anyway, I posted a comment saying much the same on the blog referenced at the start of this thread, but we should be aware that this is default standard library behaviour, not rogue application developer behaviour. -- components: XML messages: 62430 nosy: akuchling severity: normal status: open title: xml.sax and xml.dom fetch DTDs by default versions: Python 2.6 __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2124 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Changes by A.M. Kuchling: -- type: - resource usage __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2124 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
A.M. Kuchling added the comment: Here's a simple test to demonstrate the problem: from xml.sax import make_parser from xml.sax.saxutils import prepare_input_source parser = make_parser() inp = prepare_input_source('file:file.xhtml') parser.parse(inp) file.xhtml contains: ?xml version=1.0 encoding=UTF-8? !DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.1//EN http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd; html xmlns=http://www.w3.org/1999/xhtml; / If you insert a debug print into saxutils.prepare_input_source, in the branch which uses urllib.urlopen(), you get the above list of inputs accessed: the XHTML 1.1 DTD, which is nicely modular and pulls in all those other files. I don't see a good way to fix this without breaking backward compatibility to some degree. The external-general-entities features defaults to 'on', which enables this fetching; we could change the default to 'off', which would save the parsing effort, but would also mean that entities like eacute; weren't defined. If we had catalog support, we could ship the XHTML 1.1 DTDs and any other DTDs of wide usage, but we don't. __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2124 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Changes by A.M. Kuchling: -- priority: - urgent __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2124 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com
[issue2124] xml.sax and xml.dom fetch DTDs by default
Martin v. Löwis added the comment: On systems that support catalogs, the parsers should be changed to support public identifiers, using local copies of these DTDs. However, I see really no way how the library could avoid resolving the DTDs altogether. The blog is WRONG in claiming that the system identifier is not a downloadable URL, but a mere identification (the namespace is a mere identification, but our parsers would never try to download anything). The parser needs the DTDs in case external entities occur in the document (of course, download could be delayed until the first reference occurs). -- nosy: +loewis __ Tracker [EMAIL PROTECTED] http://bugs.python.org/issue2124 __ ___ Python-bugs-list mailing list Unsubscribe: http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com