On Fri, 20 May 2016, Joseph Naegele wrote:
I introduced a regression in the HtmlParser in TIKA-1938, which added the ability to emit parsed <script src="..."> tags found in the HTML <head>. <script> is not currently included in the list of valid <head> child elements in XHTMLContentHandler.java, so when the first <script> tag is parsed the <head> is immediately closed. After correcting this, because my patch treats <script> in the same manner as <base> and <link>, empty <script> tags are emitted as <script src="..." />, which is invalid (empty <script> elements must have both opening and closing tags, e.g. <script src="..."></script>). Unfortunately I haven't yet found an easy fix, so
I'd suggest opening a new jira, and attaching a small junit unit test that shows the problem (with an existing test html file if possible, if not a new one). We can then take a look
Nick
