[ 
https://issues.apache.org/jira/browse/TIKA-3319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Richard Kraus updated TIKA-3319:
--------------------------------
    Description: 
01 Tika-1.24.1.jar and 1.24 python module have been running well for months on 
my machine.
 02 Then I get tesseract and a couple other things to integrate with it.
 03 Then I upgrade python from 3.8.2 to 3.9.2
 04 So I have always set the windows 10 $env: variable to something like 
TIKA_SERVER_JAR="<yourpath>/tika-server.jar"
 05 Then I run the tika python module. I get this urllib problem....
 urllib.error.URLError: <urlopen error unknown url type: c>
 06 Supposedly this is fixed by setting the $env: variable to something like...
 TIKA_SERVER_JAR="file:///<yourpath>/tika-server.jar"
 07 So I do this and mess around with it; no dice.
 08 So then I'm trying to run Tika on powershell right?
 java -jar "C:\PATH\TO\tika-app-1.24.1.jar" --gui
 brings up the gui but it gives me these "Warnings" now...

 
{quote}Mar 14, 2021 10:33:27 PM 
org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
 WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
 See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]
 for optional dependencies.

Mar 14, 2021 10:33:27 PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
 WARNING: Tesseract OCR is installed and will be automatically applied to image 
files unless
 you've excluded the TesseractOCRParser from the default parser.
 Tesseract may dramatically slow down content extraction (TIKA-2359).
 As of Tika 1.15 (and prior versions), Tesseract is automatically called.
 In future versions of Tika, users may need to turn the TesseractOCRParser on 
via TikaConfig.
 Mar 14, 2021 10:33:27 PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
 WARNING: org.xerial's sqlite-jdbc is not loaded.
 Please provide the jar on your classpath to parse sqlite files.
 See tika-parsers/pom.xml for the correct version.
{quote}
09 so now when I try to use the --gui to parse a file I have parsed before it 
shows this message...

 
{quote}Apache Tika was unable to parse the documentApache Tika was unable to 
parse the documentat C:\CODING\Apache Tika\Test03.pdf.
 The full exception stack trace is included below:
 org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.pdf.PDFParser@473cb131 at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:293) at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at 
org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84) at 
org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:358) at 
org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:309) at 
org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:267) at 
java.desktop/javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:1967)
 at 
java.desktop/javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2308)
 at 
java.desktop/javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:405)
 at 
java.desktop/javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:262)
 at java.desktop/javax.swing.AbstractButton.doClick(AbstractButton.java:369) at 
java.desktop/javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:1020)
 at 
java.desktop/javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:1064)
 at java.desktop/java.awt.Component.processMouseEvent(Component.java:6636) at 
java.desktop/javax.swing.JComponent.processMouseEvent(JComponent.java:3342) at 
java.desktop/java.awt.Component.processEvent(Component.java:6401) at 
java.desktop/java.awt.Container.processEvent(Container.java:2263) at 
java.desktop/java.awt.Component.dispatchEventImpl(Component.java:5012) at 
java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2321) at 
java.desktop/java.awt.Component.dispatchEvent(Component.java:4844) at 
java.desktop/java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4919)
 at 
java.desktop/java.awt.LightweightDispatcher.processMouseEvent(Container.java:4548)
 at 
java.desktop/java.awt.LightweightDispatcher.dispatchEvent(Container.java:4489) 
at java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2307) at 
java.desktop/java.awt.Window.dispatchEventImpl(Window.java:2764) at 
java.desktop/java.awt.Component.dispatchEvent(Component.java:4844) at 
java.desktop/java.awt.EventQueue.dispatchEventImpl(EventQueue.java:772) at 
java.desktop/java.awt.EventQueue$4.run(EventQueue.java:721) at 
java.desktop/java.awt.EventQueue$4.run(EventQueue.java:715) at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:391)
 at 
java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85)
 at 
java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:95)
 at java.desktop/java.awt.EventQueue$5.run(EventQueue.java:745) at 
java.desktop/java.awt.EventQueue$5.run(EventQueue.java:743) at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:391)
 at 
java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85)
 at java.desktop/java.awt.EventQueue.dispatchEvent(EventQueue.java:742) at 
java.desktop/java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:203)
 at 
java.desktop/java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:124)
 at 
java.desktop/java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:113)
 at 
java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:109)
 at 
java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101)
 at 
java.desktop/java.awt.EventDispatchThread.run(EventDispatchThread.java:90)Caused
 by: java.lang.NullPointerException at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractXMPXFA(AbstractPDF2XHTML.java:209)
 at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:678)
 at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267) 
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:96) at 
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:174) at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 44 
more
{quote}
10 most notably these lines...
{quote}A) org.apache.tika.exception.TikaException: Unexpected RuntimeException 
from org.apache.tika.parser.pdf.PDFParser@473cb131
 B) Caused by: java.lang.NullPointerException
{quote}
11 now here's my java -jar tika-app-1.24.1.jar --dump-current-config
{quote}Mar 14, 2021 10:15:23 PM 
org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
 WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
 See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]
 for optional dependencies.

Mar 14, 2021 10:15:24 PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
 WARNING: Tesseract OCR is installed and will be automatically applied to image 
files unless
 you've excluded the TesseractOCRParser from the default parser.
 Tesseract may dramatically slow down content extraction (TIKA-2359).
 As of Tika 1.15 (and prior versions), Tesseract is automatically called.
 In future versions of Tika, users may need to turn the TesseractOCRParser on 
via TikaConfig.
 Mar 14, 2021 10:15:24 PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
 WARNING: org.xerial's sqlite-jdbc is not loaded.
 Please provide the jar on your classpath to parse sqlite files.
 See tika-parsers/pom.xml for the correct version.
 <?xml version="1.0" encoding="UTF-8" standalone="no"?>
 <properties>
 <!--for example: <mimeTypeRepository 
resource="/org/apache/tika/mime/tika-mimetypes.xml"/>-->
 <service-loader dynamic="true" loadErrorHandler="IGNORE"/>
 <encodingDetectors>
 <encodingDetector class="org.apache.tika.detect.DefaultEncodingDetector"/>
 </encodingDetectors>
 <translator class="org.apache.tika.language.translate.DefaultTranslator"/>
 <detectors>
 <detector class="org.apache.tika.detect.DefaultDetector"/>
 </detectors>
 <parsers>
 <parser class="org.apache.tika.parser.DefaultParser"/>
 </parsers>
 </properties>
{quote}
12 any help would be greatly appreciated. 
 13A the odd thing is when I run something like...
 java -jar tika-app-1.24.1.jar -t Test03.pdf output.txt

13B it will print the document text in powershell then print this below it 
(which I have never gotten before)...
{quote}Exception in thread "main" java.net.MalformedURLException: no protocol: 
output.txt
 at java.base/java.net.URL.<init>(URL.java:672)
 at java.base/java.net.URL.<init>(URL.java:568)
 at java.base/java.net.URL.<init>(URL.java:515)
 at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:488)
 at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149)
{quote}
So...in sum
1) it somehow doesn't "point" to a parser? (but it kinda does...)
2) it says that I'm excluding tesseract from tika....I don't know how this 
happened to begin with
3) and now...urllib in python suddenly can't figure out tika exists...

Please assist. Thank you. 

  was:
01 Tika-1.24.1.jar and 1.24 python module have been running well for months on 
my machine.
 02 Then I get tesseract and a couple other things to integrate with it.
 03 Then I upgrade python from 3.8.2 to 3.9.2
 04 So I have always set the windows 10 $env: variable to something like 
TIKA_SERVER_JAR="<yourpath>/tika-server.jar"
 05 Then I run the tika python module. I get this urllib problem....
 urllib.error.URLError: <urlopen error unknown url type: c>
 06 Supposedly this is fixed by setting the $env: variable to something like...
 TIKA_SERVER_JAR="file:///<yourpath>/tika-server.jar"
 07 So I do this and mess around with it; no dice.
 08 So then I'm trying to run Tika on powershell right?
 java -jar "C:\PATH\TO\tika-app-1.24.1.jar" --gui
 brings up the gui but it gives me these "Warnings" now...

 
{quote}Mar 14, 2021 10:33:27 PM 
org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
 WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
 See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]
 for optional dependencies.

Mar 14, 2021 10:33:27 PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
 WARNING: Tesseract OCR is installed and will be automatically applied to image 
files unless
 you've excluded the TesseractOCRParser from the default parser.
 Tesseract may dramatically slow down content extraction (TIKA-2359).
 As of Tika 1.15 (and prior versions), Tesseract is automatically called.
 In future versions of Tika, users may need to turn the TesseractOCRParser on 
via TikaConfig.
 Mar 14, 2021 10:33:27 PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
 WARNING: org.xerial's sqlite-jdbc is not loaded.
 Please provide the jar on your classpath to parse sqlite files.
 See tika-parsers/pom.xml for the correct version.
{quote}
09 so now when I try to use the --gui to parse a file I have parsed before it 
shows this message...

 
{quote}Apache Tika was unable to parse the documentApache Tika was unable to 
parse the documentat C:\CODING\Apache Tika\Test03.pdf.
 The full exception stack trace is included below:
 org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
org.apache.tika.parser.pdf.PDFParser@473cb131 at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:293) at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at 
org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at 
org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84) at 
org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:358) at 
org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:309) at 
org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:267) at 
java.desktop/javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:1967)
 at 
java.desktop/javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2308)
 at 
java.desktop/javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:405)
 at 
java.desktop/javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:262)
 at java.desktop/javax.swing.AbstractButton.doClick(AbstractButton.java:369) at 
java.desktop/javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:1020)
 at 
java.desktop/javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:1064)
 at java.desktop/java.awt.Component.processMouseEvent(Component.java:6636) at 
java.desktop/javax.swing.JComponent.processMouseEvent(JComponent.java:3342) at 
java.desktop/java.awt.Component.processEvent(Component.java:6401) at 
java.desktop/java.awt.Container.processEvent(Container.java:2263) at 
java.desktop/java.awt.Component.dispatchEventImpl(Component.java:5012) at 
java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2321) at 
java.desktop/java.awt.Component.dispatchEvent(Component.java:4844) at 
java.desktop/java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4919)
 at 
java.desktop/java.awt.LightweightDispatcher.processMouseEvent(Container.java:4548)
 at 
java.desktop/java.awt.LightweightDispatcher.dispatchEvent(Container.java:4489) 
at java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2307) at 
java.desktop/java.awt.Window.dispatchEventImpl(Window.java:2764) at 
java.desktop/java.awt.Component.dispatchEvent(Component.java:4844) at 
java.desktop/java.awt.EventQueue.dispatchEventImpl(EventQueue.java:772) at 
java.desktop/java.awt.EventQueue$4.run(EventQueue.java:721) at 
java.desktop/java.awt.EventQueue$4.run(EventQueue.java:715) at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:391)
 at 
java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85)
 at 
java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:95)
 at java.desktop/java.awt.EventQueue$5.run(EventQueue.java:745) at 
java.desktop/java.awt.EventQueue$5.run(EventQueue.java:743) at 
java.base/java.security.AccessController.doPrivileged(AccessController.java:391)
 at 
java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85)
 at java.desktop/java.awt.EventQueue.dispatchEvent(EventQueue.java:742) at 
java.desktop/java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:203)
 at 
java.desktop/java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:124)
 at 
java.desktop/java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:113)
 at 
java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:109)
 at 
java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101)
 at 
java.desktop/java.awt.EventDispatchThread.run(EventDispatchThread.java:90)Caused
 by: java.lang.NullPointerException at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractXMPXFA(AbstractPDF2XHTML.java:209)
 at 
org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:678)
 at org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267) 
at org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:96) at 
org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:174) at 
org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 44 
more{quote}
10 most notably these lines...
{quote}A) org.apache.tika.exception.TikaException: Unexpected RuntimeException 
from org.apache.tika.parser.pdf.PDFParser@473cb131
 B) Caused by: java.lang.NullPointerException
{quote}
11 now here's my java -jar tika-app-1.24.1.jar --dump-current-config
{quote}Mar 14, 2021 10:15:23 PM 
org.apache.tika.config.InitializableProblemHandler$3 handleInitializableProblem
 WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
 See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]
 for optional dependencies.

Mar 14, 2021 10:15:24 PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
 WARNING: Tesseract OCR is installed and will be automatically applied to image 
files unless
 you've excluded the TesseractOCRParser from the default parser.
 Tesseract may dramatically slow down content extraction (TIKA-2359).
 As of Tika 1.15 (and prior versions), Tesseract is automatically called.
 In future versions of Tika, users may need to turn the TesseractOCRParser on 
via TikaConfig.
 Mar 14, 2021 10:15:24 PM org.apache.tika.config.InitializableProblemHandler$3 
handleInitializableProblem
 WARNING: org.xerial's sqlite-jdbc is not loaded.
 Please provide the jar on your classpath to parse sqlite files.
 See tika-parsers/pom.xml for the correct version.
 <?xml version="1.0" encoding="UTF-8" standalone="no"?>
 <properties>
 <!--for example: <mimeTypeRepository 
resource="/org/apache/tika/mime/tika-mimetypes.xml"/>-->
 <service-loader dynamic="true" loadErrorHandler="IGNORE"/>
 <encodingDetectors>
 <encodingDetector class="org.apache.tika.detect.DefaultEncodingDetector"/>
 </encodingDetectors>
 <translator class="org.apache.tika.language.translate.DefaultTranslator"/>
 <detectors>
 <detector class="org.apache.tika.detect.DefaultDetector"/>
 </detectors>
 <parsers>
 <parser class="org.apache.tika.parser.DefaultParser"/>
 </parsers>
 </properties>
{quote}
12 any help would be greatly appreciated. 
 13A the odd thing is when I run something like...
 java -jar tika-app-1.24.1.jar -t Test03.pdf output.txt

13B it will print the document text in powershell then print this below it 
(which I have never gotten before)...
{quote}Exception in thread "main" java.net.MalformedURLException: no protocol: 
output.txt
 at java.base/java.net.URL.<init>(URL.java:672)
 at java.base/java.net.URL.<init>(URL.java:568)
 at java.base/java.net.URL.<init>(URL.java:515)
 at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:488)
 at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149)
{quote}


> Caused by: java.lang.NullPointerException (and more!)
> -----------------------------------------------------
>
>                 Key: TIKA-3319
>                 URL: https://issues.apache.org/jira/browse/TIKA-3319
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.24.1
>         Environment: Windows 10
> Tika 1.24.1.jar
> Tika 1.24 python module
> python 3.9.2
> tesseract-ocr-w64-setup-v5.0.0-alpha.20201127
> (anything else that may be relevant?)
>            Reporter: Richard Kraus
>            Priority: Major
>
> 01 Tika-1.24.1.jar and 1.24 python module have been running well for months 
> on my machine.
>  02 Then I get tesseract and a couple other things to integrate with it.
>  03 Then I upgrade python from 3.8.2 to 3.9.2
>  04 So I have always set the windows 10 $env: variable to something like 
> TIKA_SERVER_JAR="<yourpath>/tika-server.jar"
>  05 Then I run the tika python module. I get this urllib problem....
>  urllib.error.URLError: <urlopen error unknown url type: c>
>  06 Supposedly this is fixed by setting the $env: variable to something 
> like...
>  TIKA_SERVER_JAR="file:///<yourpath>/tika-server.jar"
>  07 So I do this and mess around with it; no dice.
>  08 So then I'm trying to run Tika on powershell right?
>  java -jar "C:\PATH\TO\tika-app-1.24.1.jar" --gui
>  brings up the gui but it gives me these "Warnings" now...
>  
> {quote}Mar 14, 2021 10:33:27 PM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
>  WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
>  See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]
>  for optional dependencies.
> Mar 14, 2021 10:33:27 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
>  WARNING: Tesseract OCR is installed and will be automatically applied to 
> image files unless
>  you've excluded the TesseractOCRParser from the default parser.
>  Tesseract may dramatically slow down content extraction (TIKA-2359).
>  As of Tika 1.15 (and prior versions), Tesseract is automatically called.
>  In future versions of Tika, users may need to turn the TesseractOCRParser on 
> via TikaConfig.
>  Mar 14, 2021 10:33:27 PM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
>  WARNING: org.xerial's sqlite-jdbc is not loaded.
>  Please provide the jar on your classpath to parse sqlite files.
>  See tika-parsers/pom.xml for the correct version.
> {quote}
> 09 so now when I try to use the --gui to parse a file I have parsed before it 
> shows this message...
>  
> {quote}Apache Tika was unable to parse the documentApache Tika was unable to 
> parse the documentat C:\CODING\Apache Tika\Test03.pdf.
>  The full exception stack trace is included below:
>  org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.pdf.PDFParser@473cb131 at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:293) at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at 
> org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84) at 
> org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:358) at 
> org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:309) at 
> org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:267) at 
> java.desktop/javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:1967)
>  at 
> java.desktop/javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2308)
>  at 
> java.desktop/javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:405)
>  at 
> java.desktop/javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:262)
>  at java.desktop/javax.swing.AbstractButton.doClick(AbstractButton.java:369) 
> at 
> java.desktop/javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:1020)
>  at 
> java.desktop/javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:1064)
>  at java.desktop/java.awt.Component.processMouseEvent(Component.java:6636) at 
> java.desktop/javax.swing.JComponent.processMouseEvent(JComponent.java:3342) 
> at java.desktop/java.awt.Component.processEvent(Component.java:6401) at 
> java.desktop/java.awt.Container.processEvent(Container.java:2263) at 
> java.desktop/java.awt.Component.dispatchEventImpl(Component.java:5012) at 
> java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2321) at 
> java.desktop/java.awt.Component.dispatchEvent(Component.java:4844) at 
> java.desktop/java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4919)
>  at 
> java.desktop/java.awt.LightweightDispatcher.processMouseEvent(Container.java:4548)
>  at 
> java.desktop/java.awt.LightweightDispatcher.dispatchEvent(Container.java:4489)
>  at java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2307) at 
> java.desktop/java.awt.Window.dispatchEventImpl(Window.java:2764) at 
> java.desktop/java.awt.Component.dispatchEvent(Component.java:4844) at 
> java.desktop/java.awt.EventQueue.dispatchEventImpl(EventQueue.java:772) at 
> java.desktop/java.awt.EventQueue$4.run(EventQueue.java:721) at 
> java.desktop/java.awt.EventQueue$4.run(EventQueue.java:715) at 
> java.base/java.security.AccessController.doPrivileged(AccessController.java:391)
>  at 
> java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85)
>  at 
> java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:95)
>  at java.desktop/java.awt.EventQueue$5.run(EventQueue.java:745) at 
> java.desktop/java.awt.EventQueue$5.run(EventQueue.java:743) at 
> java.base/java.security.AccessController.doPrivileged(AccessController.java:391)
>  at 
> java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85)
>  at java.desktop/java.awt.EventQueue.dispatchEvent(EventQueue.java:742) at 
> java.desktop/java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:203)
>  at 
> java.desktop/java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:124)
>  at 
> java.desktop/java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:113)
>  at 
> java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:109)
>  at 
> java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101)
>  at 
> java.desktop/java.awt.EventDispatchThread.run(EventDispatchThread.java:90)Caused
>  by: java.lang.NullPointerException at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractXMPXFA(AbstractPDF2XHTML.java:209)
>  at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:678)
>  at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267) at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:96) at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:174) at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 44 
> more
> {quote}
> 10 most notably these lines...
> {quote}A) org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.pdf.PDFParser@473cb131
>  B) Caused by: java.lang.NullPointerException
> {quote}
> 11 now here's my java -jar tika-app-1.24.1.jar --dump-current-config
> {quote}Mar 14, 2021 10:15:23 PM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
>  WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
>  See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]
>  for optional dependencies.
> Mar 14, 2021 10:15:24 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
>  WARNING: Tesseract OCR is installed and will be automatically applied to 
> image files unless
>  you've excluded the TesseractOCRParser from the default parser.
>  Tesseract may dramatically slow down content extraction (TIKA-2359).
>  As of Tika 1.15 (and prior versions), Tesseract is automatically called.
>  In future versions of Tika, users may need to turn the TesseractOCRParser on 
> via TikaConfig.
>  Mar 14, 2021 10:15:24 PM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
>  WARNING: org.xerial's sqlite-jdbc is not loaded.
>  Please provide the jar on your classpath to parse sqlite files.
>  See tika-parsers/pom.xml for the correct version.
>  <?xml version="1.0" encoding="UTF-8" standalone="no"?>
>  <properties>
>  <!--for example: <mimeTypeRepository 
> resource="/org/apache/tika/mime/tika-mimetypes.xml"/>-->
>  <service-loader dynamic="true" loadErrorHandler="IGNORE"/>
>  <encodingDetectors>
>  <encodingDetector class="org.apache.tika.detect.DefaultEncodingDetector"/>
>  </encodingDetectors>
>  <translator class="org.apache.tika.language.translate.DefaultTranslator"/>
>  <detectors>
>  <detector class="org.apache.tika.detect.DefaultDetector"/>
>  </detectors>
>  <parsers>
>  <parser class="org.apache.tika.parser.DefaultParser"/>
>  </parsers>
>  </properties>
> {quote}
> 12 any help would be greatly appreciated. 
>  13A the odd thing is when I run something like...
>  java -jar tika-app-1.24.1.jar -t Test03.pdf output.txt
> 13B it will print the document text in powershell then print this below it 
> (which I have never gotten before)...
> {quote}Exception in thread "main" java.net.MalformedURLException: no 
> protocol: output.txt
>  at java.base/java.net.URL.<init>(URL.java:672)
>  at java.base/java.net.URL.<init>(URL.java:568)
>  at java.base/java.net.URL.<init>(URL.java:515)
>  at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:488)
>  at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149)
> {quote}
> So...in sum
> 1) it somehow doesn't "point" to a parser? (but it kinda does...)
> 2) it says that I'm excluding tesseract from tika....I don't know how this 
> happened to begin with
> 3) and now...urllib in python suddenly can't figure out tika exists...
> Please assist. Thank you. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to