[
https://issues.apache.org/jira/browse/TIKA-3319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Richard Kraus updated TIKA-3319:
--------------------------------
Description:
01 Tika-1.24.1.jar and 1.24 python module have been running well for months on
my machine.
02 Then I get tesseract and a couple other things to integrate with it.
03 Then I upgrade python from 3.8.2 to 3.9.2
04 So I have always set the windows 10 $env: variable to something like
TIKA_SERVER_JAR="<yourpath>/tika-server.jar"
05 Then I run the tika python module. I get this urllib problem....
urllib.error.URLError: <urlopen error unknown url type: c>
06 Supposedly this is fixed by setting the $env: variable to something like...
TIKA_SERVER_JAR="file:///<yourpath>/tika-server.jar"
07 So I do this and mess around with it; no dice.
08 So then I'm trying to run Tika on powershell right?
java -jar "C:\PATH\TO\tika-app-1.24.1.jar" --gui
brings up the gui but it gives me these "Warnings" now...
{quote}Mar 14, 2021 10:33:27 PM
org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]
for optional dependencies.
Mar 14, 2021 10:33:27 PM org.apache.tika.config.InitializableProblemHandler$3
handleInitializableProblem
WARNING: Tesseract OCR is installed and will be automatically applied to image
files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on
via TikaConfig.
Mar 14, 2021 10:33:27 PM org.apache.tika.config.InitializableProblemHandler$3
handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
{quote}
09 so now when I try to use the --gui to parse a file I have parsed before it
shows this message...
{quote}Apache Tika was unable to parse the documentApache Tika was unable to
parse the documentat C:\CODING\Apache Tika\Test03.pdf.
The full exception stack trace is included below:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.pdf.PDFParser@473cb131 at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:293) at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at
org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84) at
org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:358) at
org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:309) at
org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:267) at
java.desktop/javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:1967)
at
java.desktop/javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2308)
at
java.desktop/javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:405)
at
java.desktop/javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:262)
at java.desktop/javax.swing.AbstractButton.doClick(AbstractButton.java:369) at
java.desktop/javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:1020)
at
java.desktop/javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:1064)
at java.desktop/java.awt.Component.processMouseEvent(Component.java:6636) at
java.desktop/javax.swing.JComponent.processMouseEvent(JComponent.java:3342) at
java.desktop/java.awt.Component.processEvent(Component.java:6401) at
java.desktop/java.awt.Container.processEvent(Container.java:2263) at
java.desktop/java.awt.Component.dispatchEventImpl(Component.java:5012) at
java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2321) at
java.desktop/java.awt.Component.dispatchEvent(Component.java:4844) at
java.desktop/java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4919)
at
java.desktop/java.awt.LightweightDispatcher.processMouseEvent(Container.java:4548)
at
java.desktop/java.awt.LightweightDispatcher.dispatchEvent(Container.java:4489)
at java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2307) at
java.desktop/java.awt.Window.dispatchEventImpl(Window.java:2764) at
java.desktop/java.awt.Component.dispatchEvent(Component.java:4844) at
java.desktop/java.awt.EventQueue.dispatchEventImpl(EventQueue.java:772) at
java.desktop/java.awt.EventQueue$4.run(EventQueue.java:721) at
java.desktop/java.awt.EventQueue$4.run(EventQueue.java:715) at
java.base/java.security.AccessController.doPrivileged(AccessController.java:391)
at
java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85)
at
java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:95)
at java.desktop/java.awt.EventQueue$5.run(EventQueue.java:745) at
java.desktop/java.awt.EventQueue$5.run(EventQueue.java:743) at
java.base/java.security.AccessController.doPrivileged(AccessController.java:391)
at
java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85)
at java.desktop/java.awt.EventQueue.dispatchEvent(EventQueue.java:742) at
java.desktop/java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:203)
at
java.desktop/java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:124)
at
java.desktop/java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:113)
at
java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:109)
at
java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101)
at
java.desktop/java.awt.EventDispatchThread.run(EventDispatchThread.java:90)Caused
by: java.lang.NullPointerException at
org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractXMPXFA(AbstractPDF2XHTML.java:209)
at
org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:678)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:96) at
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:174) at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 44
more
{quote}
10 most notably these lines...
{quote}A) org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.pdf.PDFParser@473cb131
B) Caused by: java.lang.NullPointerException
{quote}
11 now here's my java -jar tika-app-1.24.1.jar --dump-current-config
{quote}Mar 14, 2021 10:15:23 PM
org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]
for optional dependencies.
Mar 14, 2021 10:15:24 PM org.apache.tika.config.InitializableProblemHandler$3
handleInitializableProblem
WARNING: Tesseract OCR is installed and will be automatically applied to image
files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on
via TikaConfig.
Mar 14, 2021 10:15:24 PM org.apache.tika.config.InitializableProblemHandler$3
handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<properties>
<!--for example: <mimeTypeRepository
resource="/org/apache/tika/mime/tika-mimetypes.xml"/>-->
<service-loader dynamic="true" loadErrorHandler="IGNORE"/>
<encodingDetectors>
<encodingDetector class="org.apache.tika.detect.DefaultEncodingDetector"/>
</encodingDetectors>
<translator class="org.apache.tika.language.translate.DefaultTranslator"/>
<detectors>
<detector class="org.apache.tika.detect.DefaultDetector"/>
</detectors>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser"/>
</parsers>
</properties>
{quote}
12 any help would be greatly appreciated.
13A the odd thing is when I run something like...
java -jar tika-app-1.24.1.jar -t Test03.pdf output.txt
13B it will print the document text in powershell then print this below it
(which I have never gotten before)...
{quote}Exception in thread "main" java.net.MalformedURLException: no protocol:
output.txt
at java.base/java.net.URL.<init>(URL.java:672)
at java.base/java.net.URL.<init>(URL.java:568)
at java.base/java.net.URL.<init>(URL.java:515)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:488)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149)
{quote}
So...in sum
1) it somehow doesn't "point" to a parser? (but it kinda does...)
2) it says that I'm excluding tesseract from tika....I don't know how this
happened to begin with
3) and now...urllib in python suddenly can't figure out tika exists...
Please assist. Thank you.
was:
01 Tika-1.24.1.jar and 1.24 python module have been running well for months on
my machine.
02 Then I get tesseract and a couple other things to integrate with it.
03 Then I upgrade python from 3.8.2 to 3.9.2
04 So I have always set the windows 10 $env: variable to something like
TIKA_SERVER_JAR="<yourpath>/tika-server.jar"
05 Then I run the tika python module. I get this urllib problem....
urllib.error.URLError: <urlopen error unknown url type: c>
06 Supposedly this is fixed by setting the $env: variable to something like...
TIKA_SERVER_JAR="file:///<yourpath>/tika-server.jar"
07 So I do this and mess around with it; no dice.
08 So then I'm trying to run Tika on powershell right?
java -jar "C:\PATH\TO\tika-app-1.24.1.jar" --gui
brings up the gui but it gives me these "Warnings" now...
{quote}Mar 14, 2021 10:33:27 PM
org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]
for optional dependencies.
Mar 14, 2021 10:33:27 PM org.apache.tika.config.InitializableProblemHandler$3
handleInitializableProblem
WARNING: Tesseract OCR is installed and will be automatically applied to image
files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on
via TikaConfig.
Mar 14, 2021 10:33:27 PM org.apache.tika.config.InitializableProblemHandler$3
handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
{quote}
09 so now when I try to use the --gui to parse a file I have parsed before it
shows this message...
{quote}Apache Tika was unable to parse the documentApache Tika was unable to
parse the documentat C:\CODING\Apache Tika\Test03.pdf.
The full exception stack trace is included below:
org.apache.tika.exception.TikaException: Unexpected RuntimeException from
org.apache.tika.parser.pdf.PDFParser@473cb131 at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:293) at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at
org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84) at
org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:358) at
org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:309) at
org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:267) at
java.desktop/javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:1967)
at
java.desktop/javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2308)
at
java.desktop/javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:405)
at
java.desktop/javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:262)
at java.desktop/javax.swing.AbstractButton.doClick(AbstractButton.java:369) at
java.desktop/javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:1020)
at
java.desktop/javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:1064)
at java.desktop/java.awt.Component.processMouseEvent(Component.java:6636) at
java.desktop/javax.swing.JComponent.processMouseEvent(JComponent.java:3342) at
java.desktop/java.awt.Component.processEvent(Component.java:6401) at
java.desktop/java.awt.Container.processEvent(Container.java:2263) at
java.desktop/java.awt.Component.dispatchEventImpl(Component.java:5012) at
java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2321) at
java.desktop/java.awt.Component.dispatchEvent(Component.java:4844) at
java.desktop/java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4919)
at
java.desktop/java.awt.LightweightDispatcher.processMouseEvent(Container.java:4548)
at
java.desktop/java.awt.LightweightDispatcher.dispatchEvent(Container.java:4489)
at java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2307) at
java.desktop/java.awt.Window.dispatchEventImpl(Window.java:2764) at
java.desktop/java.awt.Component.dispatchEvent(Component.java:4844) at
java.desktop/java.awt.EventQueue.dispatchEventImpl(EventQueue.java:772) at
java.desktop/java.awt.EventQueue$4.run(EventQueue.java:721) at
java.desktop/java.awt.EventQueue$4.run(EventQueue.java:715) at
java.base/java.security.AccessController.doPrivileged(AccessController.java:391)
at
java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85)
at
java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:95)
at java.desktop/java.awt.EventQueue$5.run(EventQueue.java:745) at
java.desktop/java.awt.EventQueue$5.run(EventQueue.java:743) at
java.base/java.security.AccessController.doPrivileged(AccessController.java:391)
at
java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85)
at java.desktop/java.awt.EventQueue.dispatchEvent(EventQueue.java:742) at
java.desktop/java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:203)
at
java.desktop/java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:124)
at
java.desktop/java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:113)
at
java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:109)
at
java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101)
at
java.desktop/java.awt.EventDispatchThread.run(EventDispatchThread.java:90)Caused
by: java.lang.NullPointerException at
org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractXMPXFA(AbstractPDF2XHTML.java:209)
at
org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:678)
at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267)
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:96) at
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:174) at
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 44
more{quote}
10 most notably these lines...
{quote}A) org.apache.tika.exception.TikaException: Unexpected RuntimeException
from org.apache.tika.parser.pdf.PDFParser@473cb131
B) Caused by: java.lang.NullPointerException
{quote}
11 now here's my java -jar tika-app-1.24.1.jar --dump-current-config
{quote}Mar 14, 2021 10:15:23 PM
org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]
for optional dependencies.
Mar 14, 2021 10:15:24 PM org.apache.tika.config.InitializableProblemHandler$3
handleInitializableProblem
WARNING: Tesseract OCR is installed and will be automatically applied to image
files unless
you've excluded the TesseractOCRParser from the default parser.
Tesseract may dramatically slow down content extraction (TIKA-2359).
As of Tika 1.15 (and prior versions), Tesseract is automatically called.
In future versions of Tika, users may need to turn the TesseractOCRParser on
via TikaConfig.
Mar 14, 2021 10:15:24 PM org.apache.tika.config.InitializableProblemHandler$3
handleInitializableProblem
WARNING: org.xerial's sqlite-jdbc is not loaded.
Please provide the jar on your classpath to parse sqlite files.
See tika-parsers/pom.xml for the correct version.
<?xml version="1.0" encoding="UTF-8" standalone="no"?>
<properties>
<!--for example: <mimeTypeRepository
resource="/org/apache/tika/mime/tika-mimetypes.xml"/>-->
<service-loader dynamic="true" loadErrorHandler="IGNORE"/>
<encodingDetectors>
<encodingDetector class="org.apache.tika.detect.DefaultEncodingDetector"/>
</encodingDetectors>
<translator class="org.apache.tika.language.translate.DefaultTranslator"/>
<detectors>
<detector class="org.apache.tika.detect.DefaultDetector"/>
</detectors>
<parsers>
<parser class="org.apache.tika.parser.DefaultParser"/>
</parsers>
</properties>
{quote}
12 any help would be greatly appreciated.
13A the odd thing is when I run something like...
java -jar tika-app-1.24.1.jar -t Test03.pdf output.txt
13B it will print the document text in powershell then print this below it
(which I have never gotten before)...
{quote}Exception in thread "main" java.net.MalformedURLException: no protocol:
output.txt
at java.base/java.net.URL.<init>(URL.java:672)
at java.base/java.net.URL.<init>(URL.java:568)
at java.base/java.net.URL.<init>(URL.java:515)
at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:488)
at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149)
{quote}
> Caused by: java.lang.NullPointerException (and more!)
> -----------------------------------------------------
>
> Key: TIKA-3319
> URL: https://issues.apache.org/jira/browse/TIKA-3319
> Project: Tika
> Issue Type: Bug
> Components: general
> Affects Versions: 1.24.1
> Environment: Windows 10
> Tika 1.24.1.jar
> Tika 1.24 python module
> python 3.9.2
> tesseract-ocr-w64-setup-v5.0.0-alpha.20201127
> (anything else that may be relevant?)
> Reporter: Richard Kraus
> Priority: Major
>
> 01 Tika-1.24.1.jar and 1.24 python module have been running well for months
> on my machine.
> 02 Then I get tesseract and a couple other things to integrate with it.
> 03 Then I upgrade python from 3.8.2 to 3.9.2
> 04 So I have always set the windows 10 $env: variable to something like
> TIKA_SERVER_JAR="<yourpath>/tika-server.jar"
> 05 Then I run the tika python module. I get this urllib problem....
> urllib.error.URLError: <urlopen error unknown url type: c>
> 06 Supposedly this is fixed by setting the $env: variable to something
> like...
> TIKA_SERVER_JAR="file:///<yourpath>/tika-server.jar"
> 07 So I do this and mess around with it; no dice.
> 08 So then I'm trying to run Tika on powershell right?
> java -jar "C:\PATH\TO\tika-app-1.24.1.jar" --gui
> brings up the gui but it gives me these "Warnings" now...
>
> {quote}Mar 14, 2021 10:33:27 PM
> org.apache.tika.config.InitializableProblemHandler$3
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]
> for optional dependencies.
> Mar 14, 2021 10:33:27 PM org.apache.tika.config.InitializableProblemHandler$3
> handleInitializableProblem
> WARNING: Tesseract OCR is installed and will be automatically applied to
> image files unless
> you've excluded the TesseractOCRParser from the default parser.
> Tesseract may dramatically slow down content extraction (TIKA-2359).
> As of Tika 1.15 (and prior versions), Tesseract is automatically called.
> In future versions of Tika, users may need to turn the TesseractOCRParser on
> via TikaConfig.
> Mar 14, 2021 10:33:27 PM
> org.apache.tika.config.InitializableProblemHandler$3
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> {quote}
> 09 so now when I try to use the --gui to parse a file I have parsed before it
> shows this message...
>
> {quote}Apache Tika was unable to parse the documentApache Tika was unable to
> parse the documentat C:\CODING\Apache Tika\Test03.pdf.
> The full exception stack trace is included below:
> org.apache.tika.exception.TikaException: Unexpected RuntimeException from
> org.apache.tika.parser.pdf.PDFParser@473cb131 at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:293) at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at
> org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84) at
> org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:358) at
> org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:309) at
> org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:267) at
> java.desktop/javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:1967)
> at
> java.desktop/javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2308)
> at
> java.desktop/javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:405)
> at
> java.desktop/javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:262)
> at java.desktop/javax.swing.AbstractButton.doClick(AbstractButton.java:369)
> at
> java.desktop/javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:1020)
> at
> java.desktop/javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:1064)
> at java.desktop/java.awt.Component.processMouseEvent(Component.java:6636) at
> java.desktop/javax.swing.JComponent.processMouseEvent(JComponent.java:3342)
> at java.desktop/java.awt.Component.processEvent(Component.java:6401) at
> java.desktop/java.awt.Container.processEvent(Container.java:2263) at
> java.desktop/java.awt.Component.dispatchEventImpl(Component.java:5012) at
> java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2321) at
> java.desktop/java.awt.Component.dispatchEvent(Component.java:4844) at
> java.desktop/java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4919)
> at
> java.desktop/java.awt.LightweightDispatcher.processMouseEvent(Container.java:4548)
> at
> java.desktop/java.awt.LightweightDispatcher.dispatchEvent(Container.java:4489)
> at java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2307) at
> java.desktop/java.awt.Window.dispatchEventImpl(Window.java:2764) at
> java.desktop/java.awt.Component.dispatchEvent(Component.java:4844) at
> java.desktop/java.awt.EventQueue.dispatchEventImpl(EventQueue.java:772) at
> java.desktop/java.awt.EventQueue$4.run(EventQueue.java:721) at
> java.desktop/java.awt.EventQueue$4.run(EventQueue.java:715) at
> java.base/java.security.AccessController.doPrivileged(AccessController.java:391)
> at
> java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85)
> at
> java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:95)
> at java.desktop/java.awt.EventQueue$5.run(EventQueue.java:745) at
> java.desktop/java.awt.EventQueue$5.run(EventQueue.java:743) at
> java.base/java.security.AccessController.doPrivileged(AccessController.java:391)
> at
> java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85)
> at java.desktop/java.awt.EventQueue.dispatchEvent(EventQueue.java:742) at
> java.desktop/java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:203)
> at
> java.desktop/java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:124)
> at
> java.desktop/java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:113)
> at
> java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:109)
> at
> java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101)
> at
> java.desktop/java.awt.EventDispatchThread.run(EventDispatchThread.java:90)Caused
> by: java.lang.NullPointerException at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractXMPXFA(AbstractPDF2XHTML.java:209)
> at
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:678)
> at
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267) at
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:96) at
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:174) at
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 44
> more
> {quote}
> 10 most notably these lines...
> {quote}A) org.apache.tika.exception.TikaException: Unexpected
> RuntimeException from org.apache.tika.parser.pdf.PDFParser@473cb131
> B) Caused by: java.lang.NullPointerException
> {quote}
> 11 now here's my java -jar tika-app-1.24.1.jar --dump-current-config
> {quote}Mar 14, 2021 10:15:23 PM
> org.apache.tika.config.InitializableProblemHandler$3
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]
> for optional dependencies.
> Mar 14, 2021 10:15:24 PM org.apache.tika.config.InitializableProblemHandler$3
> handleInitializableProblem
> WARNING: Tesseract OCR is installed and will be automatically applied to
> image files unless
> you've excluded the TesseractOCRParser from the default parser.
> Tesseract may dramatically slow down content extraction (TIKA-2359).
> As of Tika 1.15 (and prior versions), Tesseract is automatically called.
> In future versions of Tika, users may need to turn the TesseractOCRParser on
> via TikaConfig.
> Mar 14, 2021 10:15:24 PM
> org.apache.tika.config.InitializableProblemHandler$3
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> <?xml version="1.0" encoding="UTF-8" standalone="no"?>
> <properties>
> <!--for example: <mimeTypeRepository
> resource="/org/apache/tika/mime/tika-mimetypes.xml"/>-->
> <service-loader dynamic="true" loadErrorHandler="IGNORE"/>
> <encodingDetectors>
> <encodingDetector class="org.apache.tika.detect.DefaultEncodingDetector"/>
> </encodingDetectors>
> <translator class="org.apache.tika.language.translate.DefaultTranslator"/>
> <detectors>
> <detector class="org.apache.tika.detect.DefaultDetector"/>
> </detectors>
> <parsers>
> <parser class="org.apache.tika.parser.DefaultParser"/>
> </parsers>
> </properties>
> {quote}
> 12 any help would be greatly appreciated.
> 13A the odd thing is when I run something like...
> java -jar tika-app-1.24.1.jar -t Test03.pdf output.txt
> 13B it will print the document text in powershell then print this below it
> (which I have never gotten before)...
> {quote}Exception in thread "main" java.net.MalformedURLException: no
> protocol: output.txt
> at java.base/java.net.URL.<init>(URL.java:672)
> at java.base/java.net.URL.<init>(URL.java:568)
> at java.base/java.net.URL.<init>(URL.java:515)
> at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:488)
> at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149)
> {quote}
> So...in sum
> 1) it somehow doesn't "point" to a parser? (but it kinda does...)
> 2) it says that I'm excluding tesseract from tika....I don't know how this
> happened to begin with
> 3) and now...urllib in python suddenly can't figure out tika exists...
> Please assist. Thank you.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)