[ 
https://issues.apache.org/jira/browse/ANY23-26?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13433871#comment-13433871
 ] 

Peter Ansell commented on ANY23-26:
-----------------------------------

The Xerces DOM parser seems to be corrupted by the data: URI in the test 
document for testObjectDataDataUri after the upgrade to Tika-1.2 but not before 
it. This is even though the xerces parser version did not change, with both 
before and after at 2.9.1. May have something to do with the JDOM and DOM4J 
dependency changes underneath but I am not sure how to proceed with debugging 
that.

Before the upgrade the full DOM is visible in the debugger with a breakpoint in 
TagSoupParser.getDOM(), but after the upgrade the BODY element has a first and 
only child node of type text with unrecognisable characters.
                
> Upgrade dependency to Apache Tika 1.1
> -------------------------------------
>
>                 Key: ANY23-26
>                 URL: https://issues.apache.org/jira/browse/ANY23-26
>             Project: Apache Any23
>          Issue Type: Improvement
>    Affects Versions: 0.7.0
>            Reporter: Lewis John McGibbney
>             Fix For: 0.8.0
>
>         Attachments: 14-img-src-data-url.html, 19-object-data-data-uri.html, 
> ANY23-26.patch, org.apache.any23.extractor.html.HCardExtractorTest.txt, 
> tika-1.2-dependency-tree-compare.txt
>
>
> Upgrading to Apache Tika will hopefully provide a wealth of benefits for the 
> project. This issue should act as an umbrella issue to track these changes. 
> It would be great to delegate as much as possible to Tika if deemed suitable 
> to enhance functionality and to reduce our dependencies on external projects.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to