On 26/10/2017 01:27, Jeremy Nicoll - ml gip wrote:
On 2017-10-26 00:51, RS wrote:
The corruption he refers to is a few spurious NUL characters in
<head><metadata>. The subtitles themselves are in <body> and they are
intact.
But you're a human looking at the file. XML files have a tightly defined
syntax (defined by a formal grammar called a DTD). When a program tries
to extract data from an XML file it does so using standard code that knows
what the structure of the file is because it has also read the DTD.
Anyway for a program to be able to parse an XML file the parser reads
the file character by character and at every point it knows (from the
grammar definition) exactly what could come next and can classify it
as required.
By definition an XML file is only an XML file if it entirely matches
the grammar that is defined. As soon as a parser finds a character
that makes no sense, the whole file is classed as corrupt, not an XML
file after all.
Much much more at: https://en.wikipedia.org/wiki/XML
I don't agree with you about the approach to parsing. The key exercise
is to match pairs of tags and to associate what is between the matched
pairs with keywords in the tags, but that is not relevant to this
discussion. The Wikipedia article you refer to says in 3.1
"The code point U+0000 (Null) is the only character that is not
permitted in any XML 1.0 or 1.1 document." so you are right to that extent.
That is not the end of the story. The parser has to decide what to do
when it finds an invalid character. It appears (I am guessing) that
XML::LibXML rejects the entire document even to the extent of rejecting
tag content which does not include any invalid character. It also
appears (and again I am guessing) that XML::Simple takes a different
approach and ignores invalid characters. Whether it ignores invalid
characters anywhere in the document or only if, as is the case here,
they are outside the desired tag pair (<body> ... <\body>) I am not able
to say on the evidence I have seen.
It is then up to the calling script (get_iplayer.pl) to decide what
action to take in response the action taken by the parser. It is not
adequate just to allow XML::LibXML to display "parser error" and take no
further action. My knowledge of Perl is not sufficient to understand
how get_iplayer.pl interacts with XML::LibXML.
I said that similar errors in subtitles were rare and so not worth
bothering with. That was before I became aware of the v3.02 and v3.03
changes to cease use of XML::Simple and to require version 1.91 of
XML::LibXML. In the past any similar errors will have been masked by
XML::Simple.
Best wishes
Richard
_______________________________________________
get_iplayer mailing list
get_iplayer@lists.infradead.org
http://lists.infradead.org/mailman/listinfo/get_iplayer