Also as a follow up... .This means that the JournalParser would have
never worked in tika-bundle since the org.apache.cxf.jaxrs.ext.multipart
package is required for the GrobidRESTParser to run. Is there a reason
this was not included? I'm guessing cxf-rt-rs-client dependancy maybe
caused problems with other parsers.
Now that the parsers are broken out in to projects in the 2.x branch
we could create bundles for each of them which would allow for the
JournalParser to have org.apache.cxf.jaxrs.ext.multipart embedded
without impacting the other parsers. I've stubbed out what this might
look like in the 2.x branch under the tika-parsers-bundle folder. Each
bundle dependencies embedded and inlined (simlair to tika-bundle). I've
also provided tests to make sure it starts and has a service registered
for each parser. Thoughts on this approach? Tracking this in:
https://issues.apache.org/jira/browse/TIKA-1860
- Bob
On 3/2/2016 7:46 AM, Bob Paulin wrote:
I saw it on the 2.x branch but now that you mention it's also
happening in trunk I think I see the issue. The change to the
PDFParser includes adding dependencies in the javax.xml.stream
package. The tika-bundle currently has that package marked optional:
javax.xml.stream;version="[1.0,2)";resolution:=optional,
This means that the bundle will start without this class. However now
it's required for the PDFParser to work so my guess is that the
PDFParser is not instantiating correctly and it's dropping into the
JournalParser which is also coded to handle PDFs. The JournalParser
suffers a similar fate because org.apache.cxf.jaxrs.ext.multipart is
optional on the GrobidRESTParser which gets instantiated in the parse
method.
So I tried removing :
javax.xml.stream;version="[1.0,2)";resolution:=optional,
javax.xml.stream.events;version="[1.0,2)";resolution:=optional,
javax.xml.stream.util;version="[1.0,2)";resolution:=optional,
From the tika-bundle/pom.xml and it worked! So seeing that
javax.xml.stream is provided by the JDK I'm a bit curious what those
statements were doing there to begin with. Anyone know?
- Bob
On 3/2/2016 6:26 AM, Allison, Timothy B. wrote:
Anyone have an idea why trunk is now failing? I couldn't find any
changes between the last successful build and last night's failures
that would explain this.
Test set: org.apache.tika.bundle.BundleIT
-------------------------------------------------------------------------------
Tests run: 9, Failures: 0, Errors: 1, Skipped: 0, Time elapsed:
21.997 sec <<< FAILURE!
testTikaBundle(org.apache.tika.bundle.BundleIT) Time elapsed: 2.374
sec <<< ERROR!
java.lang.ClassNotFoundException:
org.apache.cxf.jaxrs.ext.multipart.ContentDisposition not found by
org.apache.tika.bundle [17]
at
org.apache.felix.framework.BundleWiringImpl.findClassOrResourceByDelegation(BundleWiringImpl.java:1558)
at
org.apache.felix.framework.BundleWiringImpl.access$400(BundleWiringImpl.java:79)
at
org.apache.felix.framework.BundleWiringImpl$BundleClassLoader.loadClass(BundleWiringImpl.java:1998)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at
org.apache.tika.parser.journal.GrobidRESTParser.parse(GrobidRESTParser.java:69)
at
org.apache.tika.parser.journal.JournalParser.parse(JournalParser.java:60)
at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280)
-----Original Message-----
From: Hudson (JIRA) [mailto:[email protected]]
Sent: Tuesday, March 01, 2016 9:59 PM
To: [email protected]
Subject: [jira] [Commented] (TIKA-1857) Enhance PDFParser to extract
text from XFA forms
[
https://issues.apache.org/jira/browse/TIKA-1857?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15174937#comment-15174937
]
Hudson commented on TIKA-1857:
------------------------------
UNSTABLE: Integrated in tika-trunk-jdk1.7 #916 (See
[https://builds.apache.org/job/tika-trunk-jdk1.7/916/])
TIKA-1857: add basic XFA extraction support via Pascal Essiembre.
(tallison: rev dbefe9830b26d05f9ce53503565a069bcc63d7c1)
*
tika-parsers/src/test/resources/test-documents/testPDF_XFA_govdocs1_258578.pdf
*
tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
*
tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParserConfig.java
* tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
* tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDF2XHTML.java
*
tika-parsers/src/main/resources/org/apache/tika/parser/pdf/PDFParser.properties
*
tika-parsers/src/main/java/org/apache/tika/parser/pdf/XFAExtractor.java
TIKA-1857: add basic XFA extraction support via Pascal Essiembre.
(tallison: rev 7c245fa87507cf0887838001c54c65b79b7e7cbc)
* CHANGES.txt
Enhance PDFParser to extract text from XFA forms
------------------------------------------------
Key: TIKA-1857
URL: https://issues.apache.org/jira/browse/TIKA-1857
Project: Tika
Issue Type: Improvement
Components: parser
Reporter: Pascal Essiembre
Labels: patch
Fix For: 1.13
Attachments: 041617_filled_out.pdf, govdocs1_xfas.zip,
xfa_in_govdocs1.txt
Extract text from PDF Forms (XFA). Information about XFA:
https://en.wikipedia.org/wiki/XFA
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)