On 2017-10-27 11:33, RS wrote:
On 26/10/2017 01:27, Jeremy Nicoll - ml gip wrote:
On 2017-10-26 00:51, RS wrote:
The corruption he refers to is a few spurious NUL characters in
<head><metadata>. The subtitles themselves are in <body> and they
are
intact.
But you're a human looking at the file. XML files have a tightly
defined
syntax (defined by a formal grammar called a DTD). When a program
tries
to extract data from an XML file it does so using standard code that
knows
what the structure of the file is because it has also read the DTD.
Anyway for a program to be able to parse an XML file the parser reads
the file character by character and at every point it knows (from the
grammar definition) exactly what could come next and can classify it
as required.
By definition an XML file is only an XML file if it entirely matches
the grammar that is defined. As soon as a parser finds a character
that makes no sense, the whole file is classed as corrupt, not an XML
file after all.
Much much more at: https://en.wikipedia.org/wiki/XML
I don't agree with you about the approach to parsing. The key
exercise is to match pairs of tags and to associate what is between
the matched pairs with keywords in the tags, but that is not relevant
to this discussion. The Wikipedia article you refer to says in 3.1
"The code point U+0000 (Null) is the only character that is not
permitted in any XML 1.0 or 1.1 document." so you are right to that
extent.
That is not the end of the story. The parser has to decide what to do
when it finds an invalid character.
The point you seem to be missing is that for XML parsing, the parser
does not have to decide. The XML /standard/ is (however inconvenient
it is) that any error means the parse stops.
Read the wikipedia page's section on
"Well-formedness and error-handling"
What you're really arguing for is for g_ip's author NOT to use an XML
parser to parse possible badly-formed XML pages.
Maybe some sort of regex-baed text extraction could in this specific
case find the text fields in a well-formed or maybe only a little
badly-formed XML document.
It is then up to the calling script (get_iplayer.pl) to decide what
action to take in response the action taken by the parser. It is not
adequate just to allow XML::LibXML to display "parser error" and take
no further action.
Even though that's what the XML standard says IS the correct action?
--
Jeremy Nicoll - my opinions are my own
_______________________________________________
get_iplayer mailing list
[email protected]
http://lists.infradead.org/mailman/listinfo/get_iplayer