New submission from Larry Trammell <[email protected]>:
Issue 43483 was posted as a "bug" but retracted. Though the problem is real,
it is tricky to declare an UNSPECIFIED behavior to be a bug. See that issue
page for more discussion and a test case. A brief overview is repeated here.
SCENARIO - XML PARSING LOSES DATA (or not)
The parsing attempts to capture text consisting of very tiny quoted strings. A
typical content line reads something like this:
<p>Colchuck</p>
The parser implements a scheme presented at various tutorial Web sites, using
two member functions.
# Note the name attribute of the current tag group
def element_handler(self, tagname, attrs) :
self.CurrentTag = tagname
# Record the content from each "p" tag when encountered
def characters(self, content):
if self.CurrentTag == "p":
self.name = content
...
> print(parser.name)
"Colchuck"
But then, after successfully extracting content from perhaps hundreds of
thousands of XML tag sets in this way, the parsing suddenly "drops" a few
characters of content.
> print(parser.name)
"lchuck"
While this problem was observed with a SAX parser, it can affect expat parsers
as well. It affects 32-bit and 64-bit implementations the same, over several
major releases of the Python 3 system.
SPECIFIED BEHAVIOR (or not)
The "xml.sax.handler" page in the Python 3.9.2 Documentation for the Python
Standard Library (and many prior versions) states:
-----------
ContentHandler.characters(content) -- The Parser will call this method to
report each chunk of character data. SAX parsers may return all contiguous
character data in a single chunk, or they may split it into several chunks...
-----------
If it happens that the content is delivered in two chunks instead of one, the
characters() method shown above overwrites the first part of the text with the
second part, and some content seems lost. This completely explains the
observed behavior.
EXPECTED BEHAVIOR (or not)
Even though the behavior is unspecified, users can have certain expectations
about what a reasonable parser should do. Among these:
-- EFFICIENCY: the parser should do simple things simply, and complicated
things as simply as possible
-- CONSISTENCY: the parser behavior should be repeatable and dependable
The design can be considered "poor" if thorough testing cannot identify what
the actual behaviors are going to be, because those behaviors are rare and
unpredictable.
The obvious "simple thing," from the user perspective, is that the parser
should return each tiny text string as one tiny text chunk. In fact, this is
precisely what it does... 99.999% of the time. But then, suddenly, it doesn't.
One hypothesis is that when the parsing scan of raw input text reaches the end
of a large internal text buffer, it is easier from the implementer's
perspective to flush any text remaining in the old buffer prior to fetching a
new one, even if that produces a fragmented chunk with only a couple of
characters.
IMPROVEMENTS REQUIRED
Review the code to determine whether the text buffer scenario is in fact the
primary cause of inconsistent behavior. Modify the data handling to defer
delivery of content fragments that are small, carrying over a small amount of
previously scanned text so that small contiguous text chunks are recombined
rather than reported as multiple fragments. If the length of the content text
to carry over is greater than some configurable
xml.sax.handler.ContiguousChunkLength, the parser can go ahead and deliver it
as a fragment.
DOCUMENTING THE IMPROVEMENTS
Strictly speaking: none required. Undefined behaviors are undefined, whether
consistent or otherwise. But after the improvements are implemented, it would
be helpful to modify documentation to expose the new performance guarantees,
making users more aware of the possible hazards. For example, a new
description in the "xml.sax.handler" page might read as follows:
-----------
ContentHandler.characters(content) -- The Parser will call this method to
report chunks of character data. In general, character data may be reported as
a single chunk or as sequence of chunks; but character data sequences with
fewer than xml.sax.handler.ContiguousChunkLength characters, when
uninterrupted any other xml.sax.handler.ContentHandler event, are guaranteed to
be delivered as a single chunk...
-----------
----------
components: XML
messages: 389108
nosy: ridgerat1611
priority: normal
severity: normal
status: open
title: Modify SAX/expat parsing to avoid fragmentation of already-tiny content
chunks
type: enhancement
versions: Python 3.7, Python 3.8, Python 3.9
_______________________________________
Python tracker <[email protected]>
<https://bugs.python.org/issue43560>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com