[
https://issues.apache.org/jira/browse/TIKA-895?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ray Gauss II resolved TIKA-895.
-------------------------------
Resolution: Fixed
Fix Version/s: 1.3
When a {{TransformerHandler}} is used the actual writing of the final elements
is delegated to an XML serializer such as {{ToHTMLStream}} which extends
{{ToStream}}.
When {{ToStream.characters}} is called with zero length it returns immediately
and does not close the start tag of the current element, and
{{ToStream.endElement}} checks whether the start tag is open to determine
whether or not to close as {{<title/>}} or {{<title></title>}}.
It seems the code brought over from the xalan project to the JDK was locked
down quite a bit during the transition. When using xalan directly an alternate
XML serializer can be specified via XSLT or other means [1], but in the JDK
that functionality seems to have been removed as
{{TransletOutputHandlerFactory.getSerializationHandler}} has ToHTMLStream
hard-coded.
Additionally, ToHTMLStream is declared as final and the majority of the classes
which one would normally extend to use a different
{{TransletOutputHandlerFactory}} are internal, so a proper solution would
likely involve depending on xalan directly or duplicating a whole lot of code,
neither of which is ideal.
As a workaround, a {{ExpandedTitleContentHandler}} content handler decorator
was added which checks for the previous fix for this issue, a call to
{{characters(new char[0], 0, 0)}} for the title element, and if present changes
the length to 1 then catches the expected {{ArrayIndexOutOfBoundsException}}
thrown by {{ToStream.characters}}.
The result is that the title start tag is closed since the check for zero
length passes and no character writing is attempted.
{{TikaCLI}} was modified to wrap the transformer handler returned by
{{SAXTransformerFactory}} for the {{html}} output method, so only handling of
the {{title}} tag for HTML output will be affected by the change.
In the event that this approach has adverse effects for those using XML
serializers other than those present in the JDK, the change to {{TikaCLI}} can
be reverted or made an option.
Those calling Tika programmatically will need to wrap their transformer
handlers in a {{ExpandedTitleContentHandler}} as well, i.e.:
{code}
...
SAXTransformerFactory factory = (SAXTransformerFactory)
SAXTransformerFactory.newInstance();
TransformerHandler handler = factory.newTransformerHandler();
handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
handler.getTransformer().setOutputProperty(OutputKeys.INDENT, indent);
handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, encoding);
handler.setResult(new StreamResult(output));
return new ExpandedTitleContentHandler(handler);
{code}
Resolved in r1423538.
[1] http://xml.apache.org/xalan-j/usagepatterns.html
> Empty title element makes Tika-generated HTML documents not open
> ----------------------------------------------------------------
>
> Key: TIKA-895
> URL: https://issues.apache.org/jira/browse/TIKA-895
> Project: Tika
> Issue Type: Bug
> Components: metadata
> Affects Versions: 1.1
> Environment: Windows 7
> Reporter: Benoit MAGGI
> Assignee: Ray Gauss II
> Priority: Trivial
> Labels: newbie
> Fix For: 1.3
>
>
> I try to transform an empty docx to an html file.
> Ex : java -jar tika-app-1.1.jar -x example.docx > t.html
> The html file can't be open with Firefox,Internet Explorer and Chrome.
> The main point is that <title/> seems to be forbiden by html specification
> (can't get the point on html5)
> bq. http://www.w3.org/TR/html401/struct/global.html#h-7.4.2
> bq. 7.4.2 The TITLE element
> bq. <!-- The TITLE element is not considered part of the flow of text.
> bq. It should be displayed, for example as the page header or
> bq. window title. Exactly one title is required per document.
> bq. -->
> bq. <!ELEMENT TITLE
> <http://www.w3.org/TR/html401/struct/global.html#edef-TITLE> - - (#PCDATA)
> -(%head.misc;
> bq. <http://www.w3.org/TR/html401/sgml/dtd.html#head.misc> ) -- document
> title -->
> bq. <!ATTLIST TITLE %i18n <http://www.w3.org/TR/html401/sgml/dtd.html#i18n> >
> bq. *Start tag: required, End tag: required*
> For information there was the same bug with xls
> https://issues.apache.org/jira/browse/TIKA-725
> The simple solution should be to provide an empty title by default
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira