I saw it on the 2.x branch but now that you mention it's also happening in trunk I think I see the issue. The change to the PDFParser includes adding dependencies in the javax.xml.stream package. The tika-bundle currently has that package marked optional:

javax.xml.stream;version="[1.0,2)";resolution:=optional,

This means that the bundle will start without this class. However now it's required for the PDFParser to work so my guess is that the PDFParser is not instantiating correctly and it's dropping into the JournalParser which is also coded to handle PDFs. The JournalParser suffers a similar fate because org.apache.cxf.jaxrs.ext.multipart is optional on the GrobidRESTParser which gets instantiated in the parse method.

So I tried removing :
javax.xml.stream;version="[1.0,2)";resolution:=optional,
javax.xml.stream.events;version="[1.0,2)";resolution:=optional,
javax.xml.stream.util;version="[1.0,2)";resolution:=optional,
From the tika-bundle/pom.xml and it worked! So seeing that javax.xml.stream is provided by the JDK I'm a bit curious what those statements were doing there to begin with. Anyone know?

- Bob

On 3/2/2016 6:26 AM, Allison, Timothy B. wrote:
Anyone have an idea why trunk is now failing?  I couldn't find any changes 
between the last successful build and last night's failures that would explain 
this.


Test set: org.apache.tika.bundle.BundleIT
-------------------------------------------------------------------------------
Tests run: 9, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 21.997 sec <<< 
FAILURE!
testTikaBundle(org.apache.tika.bundle.BundleIT)  Time elapsed: 2.374 sec  <<< 
ERROR!
java.lang.ClassNotFoundException: 
org.apache.cxf.jaxrs.ext.multipart.ContentDisposition not found by 
org.apache.tika.bundle [17]
        at 
org.apache.felix.framework.BundleWiringImpl.findClassOrResourceByDelegation(BundleWiringImpl.java:1558)
        at 
org.apache.felix.framework.BundleWiringImpl.access$400(BundleWiringImpl.java:79)
        at 
org.apache.felix.framework.BundleWiringImpl$BundleClassLoader.loadClass(BundleWiringImpl.java:1998)
        at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
        at 
org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:69)
        at 
org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:60)
        at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)


-----Original Message-----
From: Hudson (JIRA) [mailto:[email protected]]
Sent: Tuesday, March 01, 2016 9:59 PM
To: [email protected]
Subject: [jira] [Commented] (TIKA-1857) Enhance PDFParser to extract text from 
XFA forms


     [ 
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174937#comment-15174937
 ]

Hudson commented on TIKA-1857:
------------------------------

UNSTABLE: Integrated in tika-trunk-jdk1.7 #916 (See 
[https://builds.apache.org/job/tika-trunk-jdk1.7/916/])
TIKA-1857: add basic XFA extraction support via Pascal Essiembre. (tallison: 
rev dbefe9830b26d05f9ce53503565a069bcc63d7c1)
* tika-parsers/src/test/resources/test-documents/testPDF_XFA_govdocs1_258578.pdf
* tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
* tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
* tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
* tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
* 
tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties
* tika-parsers/src/main/java/org/apache/tika/parser/pdf/XFAExtractor.java
TIKA-1857: add basic XFA extraction support via Pascal Essiembre. (tallison: 
rev 7c245fa87507cf0887838001c54c65b79b7e7cbc)
* CHANGES.txt


Enhance PDFParser to extract text from XFA forms
------------------------------------------------

                 Key: TIKA-1857
                 URL: https://issues.apache.org/jira/browse/TIKA-1857
             Project: Tika
          Issue Type: Improvement
          Components: parser
            Reporter: Pascal Essiembre
              Labels: patch
             Fix For: 1.13

         Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip, 
xfa_in_govdocs1.txt


Extract text from PDF Forms (XFA).  Information about XFA: 
https://en.wikipedia.org/wiki/XFA


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to