[
https://issues.apache.org/jira/browse/ANY23-26?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433871#comment-13433871
]
Peter Ansell commented on ANY23-26:
-----------------------------------
The Xerces DOM parser seems to be corrupted by the data: URI in the test
document for testObjectDataDataUri after the upgrade to Tika-1.2 but not before
it. This is even though the xerces parser version did not change, with both
before and after at 2.9.1. May have something to do with the JDOM and DOM4J
dependency changes underneath but I am not sure how to proceed with debugging
that.
Before the upgrade the full DOM is visible in the debugger with a breakpoint in
TagSoupParser.getDOM(), but after the upgrade the BODY element has a first and
only child node of type text with unrecognisable characters.
> Upgrade dependency to Apache Tika 1.1
> -------------------------------------
>
> Key: ANY23-26
> URL: https://issues.apache.org/jira/browse/ANY23-26
> Project: Apache Any23
> Issue Type: Improvement
> Affects Versions: 0.7.0
> Reporter: Lewis John McGibbney
> Fix For: 0.8.0
>
> Attachments: 14-img-src-data-url.html, 19-object-data-data-uri.html,
> ANY23-26.patch, org.apache.any23.extractor.html.HCardExtractorTest.txt,
> tika-1.2-dependency-tree-compare.txt
>
>
> Upgrading to Apache Tika will hopefully provide a wealth of benefits for the
> project. This issue should act as an umbrella issue to track these changes.
> It would be great to delegate as much as possible to Tika if deemed suitable
> to enhance functionality and to reduce our dependencies on external projects.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira