subject:"\[issue2124\] xml.sax and xml.dom fetch DTDs by default"

[issue2124] xml.sax and xml.dom fetch DTDs by default

2013-02-27 Thread Raynard Sandwick


Raynard Sandwick added the comment:

I have opened issue #17318 to try to specify the problem better. While I do 
think that catalogs are the correct fix for the validation use case (and thus 
would like to see something more out-of-the-box in that vein), the real trouble 
is that users are often unaware that they're sending requests to DTD URIs, so 
some combination of fixes in default behavior and/or documentation is 
definitely needed.

The external_ges feature does help, in a way, but is poorly communicated to new 
users, and moreover does not respect the difference between external DTD 
subsets and external general entities (there's a reason DOCTYPE isn't spelled 
ENTITY).

The default behavior is not well documented, and the constraining behavior of 
DTDs is frequently unnecessary. Either a user should have to explicitly enable 
validation, or it should be irrevocably obvious to a user that validation is 
the default behavior, and in both cases it should be blatantly documented that 
validation may cause network side effects. I think the input has been 
reasonable all around, and yet I find it rather insane that this issue didn't 
eventually at least result in a documentation fix, thanks to what looks like 
push-back for push-back's sake, though I will gladly admit the conclusion that 
it was underspecified is entirely valid.

Anyway, further info in the new issue...

--
nosy: +rsandwick3

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2124
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2012-01-13 Thread Brian Visel


Brian Visel aeon.descrip...@gmail.com added the comment:

..still an issue.

--
nosy: +Brian.Visel

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2124
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2012-01-13 Thread Martin v . Löwis


Martin v. Löwis mar...@v.loewis.de added the comment:

And my position still remains the same: this is not a bug. Applications 
affected by this need to use the APIs that are in place precisely to deal with 
this issue.

So I propose to close this report as invalid.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2124
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2012-01-13 Thread Brian Visel


Brian Visel aeon.descrip...@gmail.com added the comment:

Of course, you can do as you like.

http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic/

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2124
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2012-01-13 Thread Martin v . Löwis


Martin v. Löwis mar...@v.loewis.de added the comment:

Well, the issue is clearly underspecified, and different people read different 
things into it. I take your citation of the W3C blog entry that you are asking 
that caching should be employed. I read the issue entirely different, namely 
that by default no attempt to download the DTD should be made, or that the DOM 
loaders should provide better customization in that matter, or that catalogs 
shall be used.

Given that the issue was underspecified to begin with, I'm now closing it. 
Anybody who still has an issue here, please open a new issue and report your 
specific problem, preferably also proposing a solution.

If you need to follow up to this message, please do so in private email 
(mar...@v.loewis.de).

--
resolution:  - rejected
status: open - closed

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2124
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2012-01-13 Thread Paul Boddie


Paul Boddie p...@boddie.org.uk added the comment:

Note that Python 3 provided a good opportunity for doing the minimal amount of 
work here - just stop things from accessing remote DTDs - but I imagine that 
even elementary standard library improvements of this kind weren't made (let 
alone the more extensive standard library changes I advocated), so there's 
going to be a backwards compatibility situation regardless of which Python 
series is involved now, unfortunately.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2124
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2010-11-12 Thread A.M. Kuchling


Changes by A.M. Kuchling li...@amk.ca:


--
assignee: akuchling - 

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2124
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2010-07-20 Thread Mark Lawrence


Mark Lawrence breamore...@yahoo.co.uk added the comment:

Does anybody know if users are still experiencing problems with this issue?

--
nosy: +BreamoreBoy
versions: +Python 2.7, Python 3.1, Python 3.2 -Python 2.6

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2124
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2010-07-20 Thread Jean-Paul Calderone


Jean-Paul Calderone exar...@twistedmatrix.com added the comment:

Yes.

--

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2124
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2009-02-03 Thread Damien Neil


Damien Neil ne...@misago.org added the comment:

I just ran into this problem.  I was very surprised to realize that
every time the code I was working on parsed a docbook file, it generated
several HTTP requests to oasis-open.org to fetch the docbook DTDs.

I attempted to fix the issue by adding an EntityResolver that would
cache fetched DTDs.  (The documentation on how to do this is not, by the
way, very clear.)

Unfortunately, this proves to not be possible.  The main docbook DTD
includes subsidiary DTDs using relative system identifiers.  For
example, the main DTD at:

publicId: -//OASIS//DTD DocBook V4.1//EN
systemId: http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd

...includes this second DTD:

publicId: -//OASIS//ENTITIES DocBook Notations V4.4//EN
systemId: dbnotnx.mod

The EntityResolver's resolveEntity() method is not, however, passed the
base path to resolve the relative systemId from.

This makes it impossible to properly implement a parser which caches
fetched DTDs.

--
nosy: +damien

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2124
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2009-02-03 Thread Jean-Paul Calderone


Changes by Jean-Paul Calderone exar...@divmod.com:


--
nosy: +exarkun

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2124
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2009-02-03 Thread Jean-Paul Calderone


Jean-Paul Calderone exar...@divmod.com added the comment:

Though it's inconvenient to do so, you can arrange to have the locator
available from the entity resolver.  The content handler's
setDocumentLocator method will be called early on with the locator
object.  So you can give your entity resolver a reference to your
content handler and save a reference to the document locator in the
content handler.  Then in the entity resolver's resolveEntity method you
can reach over into the content handler and grab the document locator to
call its getSystemId method.

Note that you have to be careful with the InputStreams you return from
resolveEntity.  I wasn't aware of this before (and perhaps I've
misinterpreted some observer), but I just noticed that if you return an
InputSource based on a file object, the file object's name will be used
as the document id!  This is quite not what you want.  InputStream has a
setSystemId method, but even if you call it before you call
setByteStream, the system id will be the name of the file object passed
to setByteStream.  Perhaps calling these two methods in the opposite
order will fix this, I'm not sure, I haven't tried.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2124
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2009-02-03 Thread Martin v. Löwis


Martin v. Löwis mar...@v.loewis.de added the comment:

 EntityResolver.resolveEntity() is called with the publicId and systemId as 
 arguments. It does not receive a locator.

Sure. But ContentHandler.setDocumentLocator receives it, and you are
supposed to store it for the entire parse, to always know what entity
is being processed if you want to.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2124
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2009-02-03 Thread Damien Neil


Damien Neil ne...@misago.org added the comment:

On Feb 3, 2009, at 1:42 PM, Martin v. Löwis wrote:
 Sure. But ContentHandler.setDocumentLocator receives it, and you are
 supposed to store it for the entire parse, to always know what entity
 is being processed if you want to.

Where in the following sequence am I supposed to receive the document 
locator?

parser = xml.sax.make_parser()
parser.setEntityResolver(CachingEntityResolver())
doc = xml.dom.minidom.parse('file.xml', parser)

The content handler is being created deep inside xml.dom.  It does, in 
fact, store the document locator, but not in any place that I can easily 
access without breaking several layers of abstraction.

Or, as a more general question: How can I get a DOM tree that includes 
external entities?  If there's an easy way to do it, the documentation 
does not make it clear at all.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2124
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2009-02-03 Thread Damien Neil


Damien Neil ne...@misago.org added the comment:

I just discovered another really fun wrinkle in this.

Let's say I want to have my entity resolver return a reference to my 
local copy of a DTD.  I write:

source = xml.sax.InputSource()
source.setPublicId(publicId)
source.setSystemId(systemId)
source.setCharacterStream(file(path_to_local_copy))
return source

This will appear to work.

However, the parser will still silently fetch the DTD over the network!  
I needed to call source.setByteStream()--character streams are silently 
ignored.

I'd never have noticed this if I hadn't used strace on my process and 
noticed a slew of recvfrom() calls that shouldn't have been there.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2124
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2009-02-03 Thread Jean-Paul Calderone


Jean-Paul Calderone exar...@divmod.com added the comment:

 It's indeed possible to provide that as a third-party module; one
 would have to implement an EntityResolver, and applications would
 have to use it. If there was a need for such a thing, somebody would
 have done it years ago.

I don't think this is true, for several reasons.

First, most people never notice that they are writing or using an
application which has this behavior.  This is because the behavior is
transparent in almost all cases, manifesting only as a slowdown.  Often,
no one is paying close attention to whether a function takes 0.1s or
0.5s.  So code gets written which fetches resources from the network by
accident.  Similarly, users generally don't have any idea that this kind
of defect is possible, or they don't think it's unusual behavior.  In
general, they're not equipped to understand why this is a bad thing.  At
best, they may decide a program is slow and be upset, but out of the
myriad reasons a program might be slow, they have no particular reason
to settle on this one as the real cause.

Second, it is *difficult* to implement the non-network behavior. 
Seriously, seriously difficult.  The documentation for these APIs is
obscure and incomplete in places.  It takes a long time to puzzle out
what it means and how to achieve the desired behavior.  I wouldn't be
surprised if many people simply gave up and either switched to another
parser or decided they could live with the slowdown (perhaps not
realizing that it could be arbitrarily long and might add a network
dependency to a program which doesn't already have one).

Third, there are several pitfalls on the way to a correct implementation
of the non-network behavior which may lead a developer to decide they
have succeeded when they have actually failed.  The most obvious is that
simply turning off the external-general-entities feature appears to
solve the problem but actually changes the parser's behavior so that it
will silently drop named character entities.  This is quite surprising
behavior to anyone who hasn't spent a lot of time with the XML
specification.

So I think it would be a significant improvement if there were a simple,
documented way to switch from network retrieval to local retrieval from
a cache.  I also think that the current default behavior is wrong.  The
default should not be to go out to the network, even if there is a
well-behaved HTTP caching client involved.  So the current behavior
should be deprecated.  After a sufficient period of time, the local-only
behavior should be made the default.  I don't see any problem with
making it easy to re-enable the old behavior, though.

 -1 on issuing a warning. I really cannot see much of a problem in
 this entire issue. XML was designed to be straightforwardly usable
 over the Internet (XML rec., section 1.1), and this issue is a
 direct consequence of that design decision. You might just as well
 warn people against using XML in the first place.

Quoting part of the XML design goals isn't a strong argument for the
current behavior.  Transparently requesting network resources in order
to process local data isn't a necessary consequence of the
straightforwardly usable over the internet goal.  Allowing this
behavior to be explicitly enabled, but not enabled by default, easily
meets this goal.  Straightforwardly supporting a local cache of DTDs is
even better, since it improves application performance and removes a
large number of of security concerns.  With the general disfavor of DTDs
(in favor of other validation techniques, such as relax-ng) and the
general disfavor of named character entities (basically only XHTML uses
them), I find it extremely difficult to justify Python's current default
behavior.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2124
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2009-02-03 Thread Martin v. Löwis


Martin v. Löwis mar...@v.loewis.de added the comment:

 Where in the following sequence am I supposed to receive the document 
 locator?
 
 parser = xml.sax.make_parser()
 parser.setEntityResolver(CachingEntityResolver())
 doc = xml.dom.minidom.parse('file.xml', parser)

This is DOM parsing, not SAX parsing.

 The content handler is being created deep inside xml.dom.  It does, in 
 fact, store the document locator, but not in any place that I can easily 
 access without breaking several layers of abstraction.

So break layers of abstraction, then. Or else, use dom.expatbuilder,
and ignore SAX/pulldom for DOM parsing.

 Or, as a more general question: How can I get a DOM tree that includes 
 external entities?  If there's an easy way to do it, the documentation 
 does not make it clear at all.

This tracker is really not the place to ask questions; use python-list
for that.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2124
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2009-02-03 Thread Damien Neil


Damien Neil ne...@misago.org added the comment:

On Feb 3, 2009, at 11:23 AM, Martin v. Löwis wrote:
 I don't think this is actually the case. Did you try calling getSystemId
 on the locator?

EntityResolver.resolveEntity() is called with the publicId and systemId as 
arguments. It does not receive a locator.

- Damien

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2124
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2009-02-03 Thread Martin v. Löwis


Martin v. Löwis mar...@v.loewis.de added the comment:

 The EntityResolver's resolveEntity() method is not, however, passed the
 base path to resolve the relative systemId from.
 
 This makes it impossible to properly implement a parser which caches
 fetched DTDs.

I don't think this is actually the case. Did you try calling getSystemId
on the locator?

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2124
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2009-02-03 Thread Damien Neil


Damien Neil ne...@misago.org added the comment:

On Feb 3, 2009, at 3:12 PM, Martin v. Löwis wrote:
 This is DOM parsing, not SAX parsing.

1) The title of this ticket begins with xml.sax and xml.dom
2) I am creating a SAX parser and passing it to xml.dom, which uses it.

 So break layers of abstraction, then. Or else, use dom.expatbuilder,
 and ignore SAX/pulldom for DOM parsing.

Is that really the answer?

Read the source code to xml.dom.*, and write hacks based on what I find 
there?  Note also that xml.dom.expatbuilder does not appear to be an 
external API--there is no mention of it in the documentation for 
xml.dom.*.

 This tracker is really not the place to ask questions; use python-list
 for that.

That was a rhetorical question.

The answer is, as best I can tell, You can't do that.

___
Python tracker rep...@bugs.python.org
http://bugs.python.org/issue2124
___
___
Python-bugs-list mailing list
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2008-02-23 Thread A.M. Kuchling


A.M. Kuchling added the comment:

The solution of adding caching, If-Modified-Since, etc. is a good one,
but I quail in fear at the prospect of expanding the saxutils resolver
into a fully caching HTML agent that uses a cache across processes.  We
should really be encouraging people to use more capable libraries such
as httplib2 (http://code.google.com/p/httplib2/), but this is slightly
at war 
with the batteries-included philosophy.

So, I propose we:

* add warnings to the urllib, urllib2, saxutil module docs that parsing
can retrieve arbitrary resources over the network, and encourage the
user to use a smarter library such as httplib2.
* update the urllib2 HOWTO to mention this.

I'm willing to do the necessary writing.

--
assignee:  - akuchling
priority: urgent - normal

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2124
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2008-02-23 Thread Martin v. Löwis


Martin v. Löwis added the comment:

I may have lost track somewhere: what does have urllib* to do with this
issue?

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2124
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2008-02-17 Thread Virgil Dupras


Virgil Dupras added the comment:

-1 on the systematic warnings too, but what I was talking about is a 
warning that would say The server you are trying to fetch your resource 
from is refusing the connection. Don't cha think you misbehave? only on 
5xx and 4xx responses, not on every remote resource fetching.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2124
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2008-02-17 Thread ajaksu


ajaksu added the comment:

Martin, I agree that simply not resolving DTDs is an unreasonable
request (and said so in the blog post). But IMHO there are lots of
possible optimizations, and the most valuable would be those darn easy
for newcomers to understand and use.

In Python, a winning combo would be an arbitrary (and explicit) FS
dtdcache that people could use with simple a drop-in import (from a
third-party module?). Perhaps the cache lives in a pickled dictionary
with IDs, checksums and DTDs. Could also be a sqlite DB, if updating the
dict becomes problematic.

In that scenario, AMK could save latter W3C hits with:

from xml.sax import make_parser
from dtdcache.sax.saxutils import prepare_input_source # - dtdcache
parser = make_parser()
inp = prepare_input_source('file:file.xhtml', cache=/tmp/xmlcache)

It might be interesting to have read-only, force-write and read-write
modes. Not sure how to map that on EntityResolver and DTD consumers (I'm
no XML user myself).

Regarding the std-lib, I believe effective caching hooks for DTDs trump
implementing in-memory or sqlite/FS. IMNSHO, correct, accessible support
for catalogs shouldn't be the only change, as caching should give better
performance on both ends.

--
nosy: +ajaksu2

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2124
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2008-02-16 Thread Virgil Dupras


Virgil Dupras added the comment:

The blog page talked about 503 responses. What about issuing a warning 
on these responses? Maybe it would be enough to make developers aware of 
the problem?

Or what about in-memory caching of the DTDs? Sure, it wouldn't be as 
good as a catalog or anything, but it might help for the worst cases?

--
nosy: +vdupras

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2124
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2008-02-16 Thread Paul Boddie


Paul Boddie added the comment:

(Andrew, thanks for making a bug, and apologies for not reporting this
in a timely fashion.)

Although an in-memory caching solution might seem to be sufficient, if
one considers things like CGI programs, it's clear that such programs
aren't going to benefit from such a solution. It would be interesting to
know what widely deployed software does use the affected parsers,
though. A Google code search might be helpful.

I think that the nicest compatible solution would be to have some kind
of filesystem cache for the downloaded resources, but I don't recall any
standard library caching solution of this nature. Things like being able
to write to a known directory, perhaps using the temporary file APIs
which should work even as a very unprivileged user, would be useful
properties of such a solution.

--
nosy: +pboddie

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2124
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2008-02-16 Thread A.M. Kuchling


A.M. Kuchling added the comment:

What if we just tried to make the remote accesses apparent to the user,
by making a warning.warn() call in the default implementation that was
deactivated by a setFeature() call.  With a warning, code will continue
to run but the user will at least be aware they're hitting a remote
resource, and can think about it, even if they decide to suppress the
warning.

We should also modify the docs to point this out; it's not likely to
help very much, but it's still worth doing.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2124
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2008-02-16 Thread Martin v. Löwis


Martin v. Löwis added the comment:

-1 on issuing a warning. I really cannot see much of a problem in this
entire issue. XML was designed to be straightforwardly usable over the
Internet (XML rec., section 1.1), and this issue is a direct
consequence of that design decision. You might just as well warn people
against using XML in the first place.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2124
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2008-02-15 Thread A.M. Kuchling


New submission from A.M. Kuchling:

The W3C posted an item at
http://www.w3.org/blog/systeam/2008/02/08/w3c_s_excessive_dtd_traffic
describing how their DTDs are being fetched up to 130M times per day.  

The Python parsers are part of the problem, as 
noted by Paul Boddie on the python-advocacy list:

There are two places which stand out:

xml/dom/xmlbuilder.py
xml/sax/saxutils.py

What gives them away is the way as the cause of the described problem is
that 
they are both fetching things which are given as system identifiers - the 
things you get in the document type declaration at the top of an XML
document 
which look like a URL.

If you then put some trace statements into the code and then try and parse 
something using, for example, the xml.sax API, it becomes evident that by 
default the parser attempts to fetch lots of DTD-related resources, not 
helped by the way that stuff like XHTML is now modular and thus employs 
lots of separate files in the DTD. This is obvious because you get
something 
like this printed to the terminal:

saxutils: opened http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-inlstyle-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-framework-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-datatypes-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-qname-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-events-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-attribs-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml11-model-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-charent-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-lat1.ent
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-symbol.ent
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-special.ent
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-text-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-inlstruct-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-inlphras-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-blkstruct-1.mod
saxutils: opened http://www.w3.org/MarkUp/DTD/xhtml-blkphras-1.mod

Of course, the best practice with APIs like SAX is that you define
your own 
resolver or handler classes which don't go and fetch DTDs from the W3C all 
the time, but this isn't the out of the box behaviour. Instead, 
implementers have chosen the most convenient behaviour which arguably 
involves the least effort in telling people how to get hold of DTDs so that 
they may validate their documents, but which isn't necessarily the right 
thing in terms of network behaviour. Naturally, since defining specific 
resolvers/handlers involves a lot of boilerplate (and you should try it in 
Java!) then a lot of developers just incur the penalty of having the
default 
behaviour, instead of considering the finer points of the various W3C 
specifications (which is never really any fun).

Anyway, I posted a comment saying much the same on the blog referenced
at the 
start of this thread, but we should be aware that this is default standard 
library behaviour, not rogue application developer behaviour.

--
components: XML
messages: 62430
nosy: akuchling
severity: normal
status: open
title: xml.sax and xml.dom fetch DTDs by default
versions: Python 2.6

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2124
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2008-02-15 Thread A.M. Kuchling


Changes by A.M. Kuchling:


--
type:  - resource usage

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2124
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2008-02-15 Thread A.M. Kuchling


A.M. Kuchling added the comment:

Here's a simple test to demonstrate the problem:

from xml.sax import make_parser
from xml.sax.saxutils import prepare_input_source
parser = make_parser()
inp = prepare_input_source('file:file.xhtml')
parser.parse(inp)

file.xhtml contains:

?xml version=1.0 encoding=UTF-8?
!DOCTYPE html PUBLIC -//W3C//DTD XHTML 1.1//EN
http://www.w3.org/TR/xhtml11/DTD/xhtml11.dtd;
html xmlns=http://www.w3.org/1999/xhtml; /

If you insert a debug print into saxutils.prepare_input_source, 
in the branch which uses urllib.urlopen(), you get the above list of
inputs accessed: the XHTML 1.1 DTD, which is nicely modular and pulls in
all those other files.

I don't see a good way to fix this without breaking backward
compatibility to some degree.  The 
external-general-entities features defaults to 'on', which enables this
fetching; we could change the default to 'off', which would save the
parsing effort, but would also mean that entities like eacute; weren't
defined.

If we had catalog support, we could ship the XHTML 1.1 DTDs and any
other DTDs of wide usage, but we don't.

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2124
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2008-02-15 Thread A.M. Kuchling


Changes by A.M. Kuchling:


--
priority:  - urgent

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2124
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

2008-02-15 Thread Martin v. Löwis


Martin v. Löwis added the comment:

On systems that support catalogs, the parsers should be changed to
support public identifiers, using local copies of these DTDs. 

However, I see really no way how the library could avoid resolving the
DTDs altogether. The blog is WRONG in claiming that the system
identifier is not a downloadable URL, but a mere identification (the
namespace is a mere identification, but our parsers would never try to
download anything). The parser needs the DTDs in case external entities
occur in the document (of course, download could be delayed until the
first reference occurs).

--
nosy: +loewis

__
Tracker [EMAIL PROTECTED]
http://bugs.python.org/issue2124
__
___
Python-bugs-list mailing list 
Unsubscribe: 
http://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

[issue2124] xml.sax and xml.dom fetch DTDs by default

33 matches

Site Navigation

Mail list logo

Footer information