On Fri, 20 May 2016, Joseph Naegele wrote:
I introduced a regression in the HtmlParser in TIKA-1938, which added the
ability to emit parsed <script src="..."> tags found in the HTML <head>.
<script> is not currently included in the list of valid <head> child
elements in XHTMLContentHandler.java, so when the first <script> tag is
parsed the <head> is immediately closed. After correcting this, because my
patch treats <script> in the same manner as <base> and <link>, empty
<script> tags are emitted as <script src="..." />, which is invalid (empty
<script> elements must have both opening and closing tags, e.g. <script
src="..."></script>). Unfortunately I haven't yet found an easy fix, so

I'd suggest opening a new jira, and attaching a small junit unit test that shows the problem (with an existing test html file if possible, if not a new one). We can then take a look

Nick

Reply via email to