[ 
https://issues.apache.org/jira/browse/TIKA-3319?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17303226#comment-17303226
 ] 

Richard Kraus edited comment on TIKA-3319 at 3/17/21, 9:32 AM:
---------------------------------------------------------------

[~tilman]

I only got to spend a little time on this today

Installing tika has been kind of problematic for me. I feel pretty bad at this.

01 I have maven installed (via eclipse)
 02 I went to the Maven Repository and got a bunch of files I'm not really sure 
are necessary to do what I want (parse pdfs) but I got them anyway (core, 
langdetect, parsers, xmp) (Maven Repository link: 
[https://mvnrepository.com/artifact/org.apache.tika] )
 03 I unzipped tika-app-1.25 into its own folder
 04 So now its in...PATH\Tika-1.25\tika-app-1.25(all the tika .jar files)
 05 So my understanding is I need to open windows command line in admin mode 
and do something like this: mvn install -DskipTests (because the tests seemed 
to trip up the computer somehow when I installed tika 1.24)
 06 So command line Maven gives me a Build Failure message 
{quote}[ERROR] The goal you specified requires a project to execute but there 
is no POM in this directory (C:\CODING\Apache Tika\Tika-1.25). Please verify 
you invoked Maven from the correct directory. -> [Help 1]
 [ERROR]
 [ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
 [ERROR] Re-run Maven using the -X switch to enable full debug logging.
 [ERROR]
 [ERROR] For more information about the errors and possible solutions, please 
read the following articles:
 [ERROR] [Help 1] 
[http://cwiki.apache.org/confluence/display/MAVEN/MissingProjectException]
{quote}
07 then I move to PATH\Tika-1.25\tika-app-1.25(all the tika .jar files) folder 
and execute....same thing
 08 then I read the Help 1 page referenced in the ERROR messages
 09 it says...
{quote}This error indicates that you tried to execute a goal which requires a 
POM but Maven didn't find a {{pom.xml}} file in the directory you invoked it 
from. In most cases, fixing this is merely a matter of changing the current 
directory in the command prompt to point at your project's base directory. If 
you don't want to change the current directory, you have to tell Maven 
explicitly where the POM to build resides like this:
 mvn -f path/to/pom.xml <goals> ...
{quote}
 
 10 so then I'm like...which pom.xml? There's like...20
 11 I read somewhere that you want to go on Maven Repository and get some .xml 
file out of the modules there (which is partly why I did what I did at step 02) 
but I couldn't find anything that looked like an install .xml file just hanging 
out showing itself; and I definitely didn't see any pom.xml files in the Maven 
Repository 

 


was (Author: rickuls):
[~tilman]

I only got to spend a little time on this today

Installing tika has been kind of problematic for me. I feel pretty bad at this.

01 I have maven installed (via eclipse)
02 I went to the Maven Repository and got a bunch of files I'm not really sure 
are necessary to do what I want (parse pdfs) but I got them anyway (core, 
langdetect, parsers, xmp)
03 I unzipped tika-app-1.25 into its own folder
04 So now its in...PATH\Tika-1.25\tika-app-1.25\(all the tika .jar files)
05 So my understanding is I need to open windows command line in admin mode and 
do something like this: mvn install -DskipTests (because the tests seemed to 
trip up the computer somehow when I installed tika 1.24)
06 So command line Maven gives me a Build Failure message 
{quote}[ERROR] The goal you specified requires a project to execute but there 
is no POM in this directory (C:\CODING\Apache Tika\Tika-1.25). Please verify 
you invoked Maven from the correct directory. -> [Help 1]
[ERROR]
[ERROR] To see the full stack trace of the errors, re-run Maven with the -e 
switch.
[ERROR] Re-run Maven using the -X switch to enable full debug logging.
[ERROR]
[ERROR] For more information about the errors and possible solutions, please 
read the following articles:
[ERROR] [Help 1] 
[http://cwiki.apache.org/confluence/display/MAVEN/MissingProjectException]
{quote}
07 then I move to PATH\Tika-1.25\tika-app-1.25\(all the tika .jar files) folder 
and execute....same thing
08 then I read the Help 1 page referenced in the ERROR messages
09 it says...
{quote}This error indicates that you tried to execute a goal which requires a 
POM but Maven didn't find a {{pom.xml}} file in the directory you invoked it 
from. In most cases, fixing this is merely a matter of changing the current 
directory in the command prompt to point at your project's base directory. If 
you don't want to change the current directory, you have to tell Maven 
explicitly where the POM to build resides like this:
mvn -f path/to/pom.xml <goals> ...
{quote}
 
10 so then I'm like...which pom.xml? There's like...20
11 I read somewhere that you want to go on Maven Repository and get some .xml 
file out of the modules there (which is partly why I did what I did at step 02) 
but I couldn't find anything that looked like an install .xml file just hanging 
out showing itself; and I definitely didn't see any pom.xml files in the Maven 
Repository 

 

> Caused by: java.lang.NullPointerException (and more!)
> -----------------------------------------------------
>
>                 Key: TIKA-3319
>                 URL: https://issues.apache.org/jira/browse/TIKA-3319
>             Project: Tika
>          Issue Type: Bug
>          Components: general
>    Affects Versions: 1.24.1
>         Environment: Windows 10
> Tika 1.24.1.jar
> Tika 1.24 python module
> python 3.9.2
> tesseract-ocr-w64-setup-v5.0.0-alpha.20201127
> (anything else that may be relevant?)
>            Reporter: Richard Kraus
>            Priority: Major
>
> So...in sum
>  1) it somehow doesn't "point" to a parser? (but it kinda does...)
>  2) it says that I'm excluding tesseract from tika....I don't know how this 
> happened to begin with
>  3) and now...urllib in python by using the tika package suddenly can't 
> figure out tika exists...
> Please assist. Thank you in advance. 
> 01 Tika-1.24.1.jar and 1.24 python module have been running well for months 
> on my machine.
>  02 Then I get tesseract and a couple other things to integrate with it.
>  03 Then I upgrade python from 3.8.2 to 3.9.2
>  04 So I have always set the windows 10 $env: variable to something like 
> TIKA_SERVER_JAR="<yourpath>/tika-server.jar"
>  05 Then I run the tika python module. I get this urllib problem....
>  urllib.error.URLError: <urlopen error unknown url type: c>
>  06 Supposedly this is fixed by setting the $env: variable to something 
> like...
>  TIKA_SERVER_JAR="file:///<yourpath>/tika-server.jar"
>  07 So I do this and mess around with it; no dice.
>  08 So then I'm trying to run Tika on powershell right?
>  java -jar "C:\PATH\TO\tika-app-1.24.1.jar" --gui
>  brings up the gui but it gives me these "Warnings" now...
>  
> {quote}Mar 14, 2021 10:33:27 PM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
>  WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
>  See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]
>  for optional dependencies.
> Mar 14, 2021 10:33:27 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
>  WARNING: Tesseract OCR is installed and will be automatically applied to 
> image files unless
>  you've excluded the TesseractOCRParser from the default parser.
>  Tesseract may dramatically slow down content extraction (TIKA-2359).
>  As of Tika 1.15 (and prior versions), Tesseract is automatically called.
>  In future versions of Tika, users may need to turn the TesseractOCRParser on 
> via TikaConfig.
>  Mar 14, 2021 10:33:27 PM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
>  WARNING: org.xerial's sqlite-jdbc is not loaded.
>  Please provide the jar on your classpath to parse sqlite files.
>  See tika-parsers/pom.xml for the correct version.
> {quote}
> 09 so now when I try to use the --gui to parse a file I have parsed before it 
> shows this message...
>  
> {quote}Apache Tika was unable to parse the documentApache Tika was unable to 
> parse the documentat C:\CODING\Apache Tika\Test03.pdf.
>  The full exception stack trace is included below:
>  org.apache.tika.exception.TikaException: Unexpected RuntimeException from 
> org.apache.tika.parser.pdf.PDFParser@473cb131 at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:293) at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) at 
> org.apache.tika.parser.AutoDetectParser.parse(AutoDetectParser.java:143) at 
> org.apache.tika.parser.ParserDecorator.parse(ParserDecorator.java:188) at 
> org.apache.tika.parser.DigestingParser.parse(DigestingParser.java:84) at 
> org.apache.tika.gui.TikaGUI.handleStream(TikaGUI.java:358) at 
> org.apache.tika.gui.TikaGUI.openFile(TikaGUI.java:309) at 
> org.apache.tika.gui.TikaGUI.actionPerformed(TikaGUI.java:267) at 
> java.desktop/javax.swing.AbstractButton.fireActionPerformed(AbstractButton.java:1967)
>  at 
> java.desktop/javax.swing.AbstractButton$Handler.actionPerformed(AbstractButton.java:2308)
>  at 
> java.desktop/javax.swing.DefaultButtonModel.fireActionPerformed(DefaultButtonModel.java:405)
>  at 
> java.desktop/javax.swing.DefaultButtonModel.setPressed(DefaultButtonModel.java:262)
>  at java.desktop/javax.swing.AbstractButton.doClick(AbstractButton.java:369) 
> at 
> java.desktop/javax.swing.plaf.basic.BasicMenuItemUI.doClick(BasicMenuItemUI.java:1020)
>  at 
> java.desktop/javax.swing.plaf.basic.BasicMenuItemUI$Handler.mouseReleased(BasicMenuItemUI.java:1064)
>  at java.desktop/java.awt.Component.processMouseEvent(Component.java:6636) at 
> java.desktop/javax.swing.JComponent.processMouseEvent(JComponent.java:3342) 
> at java.desktop/java.awt.Component.processEvent(Component.java:6401) at 
> java.desktop/java.awt.Container.processEvent(Container.java:2263) at 
> java.desktop/java.awt.Component.dispatchEventImpl(Component.java:5012) at 
> java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2321) at 
> java.desktop/java.awt.Component.dispatchEvent(Component.java:4844) at 
> java.desktop/java.awt.LightweightDispatcher.retargetMouseEvent(Container.java:4919)
>  at 
> java.desktop/java.awt.LightweightDispatcher.processMouseEvent(Container.java:4548)
>  at 
> java.desktop/java.awt.LightweightDispatcher.dispatchEvent(Container.java:4489)
>  at java.desktop/java.awt.Container.dispatchEventImpl(Container.java:2307) at 
> java.desktop/java.awt.Window.dispatchEventImpl(Window.java:2764) at 
> java.desktop/java.awt.Component.dispatchEvent(Component.java:4844) at 
> java.desktop/java.awt.EventQueue.dispatchEventImpl(EventQueue.java:772) at 
> java.desktop/java.awt.EventQueue$4.run(EventQueue.java:721) at 
> java.desktop/java.awt.EventQueue$4.run(EventQueue.java:715) at 
> java.base/java.security.AccessController.doPrivileged(AccessController.java:391)
>  at 
> java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85)
>  at 
> java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:95)
>  at java.desktop/java.awt.EventQueue$5.run(EventQueue.java:745) at 
> java.desktop/java.awt.EventQueue$5.run(EventQueue.java:743) at 
> java.base/java.security.AccessController.doPrivileged(AccessController.java:391)
>  at 
> java.base/java.security.ProtectionDomain$JavaSecurityAccessImpl.doIntersectionPrivilege(ProtectionDomain.java:85)
>  at java.desktop/java.awt.EventQueue.dispatchEvent(EventQueue.java:742) at 
> java.desktop/java.awt.EventDispatchThread.pumpOneEventForFilters(EventDispatchThread.java:203)
>  at 
> java.desktop/java.awt.EventDispatchThread.pumpEventsForFilter(EventDispatchThread.java:124)
>  at 
> java.desktop/java.awt.EventDispatchThread.pumpEventsForHierarchy(EventDispatchThread.java:113)
>  at 
> java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:109)
>  at 
> java.desktop/java.awt.EventDispatchThread.pumpEvents(EventDispatchThread.java:101)
>  at 
> java.desktop/java.awt.EventDispatchThread.run(EventDispatchThread.java:90)Caused
>  by: java.lang.NullPointerException at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.extractXMPXFA(AbstractPDF2XHTML.java:209)
>  at 
> org.apache.tika.parser.pdf.AbstractPDF2XHTML.endDocument(AbstractPDF2XHTML.java:678)
>  at 
> org.apache.pdfbox.text.PDFTextStripper.writeText(PDFTextStripper.java:267) at 
> org.apache.tika.parser.pdf.PDF2XHTML.process(PDF2XHTML.java:96) at 
> org.apache.tika.parser.pdf.PDFParser.parse(PDFParser.java:174) at 
> org.apache.tika.parser.CompositeParser.parse(CompositeParser.java:280) ... 44 
> more
> {quote}
> 10 most notably these lines...
> {quote}A) org.apache.tika.exception.TikaException: Unexpected 
> RuntimeException from org.apache.tika.parser.pdf.PDFParser@473cb131
>  B) Caused by: java.lang.NullPointerException
> {quote}
> 11 now here's my java -jar tika-app-1.24.1.jar --dump-current-config
> {quote}Mar 14, 2021 10:15:23 PM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
>  WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
>  See [https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io]
>  for optional dependencies.
> Mar 14, 2021 10:15:24 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
>  WARNING: Tesseract OCR is installed and will be automatically applied to 
> image files unless
>  you've excluded the TesseractOCRParser from the default parser.
>  Tesseract may dramatically slow down content extraction (TIKA-2359).
>  As of Tika 1.15 (and prior versions), Tesseract is automatically called.
>  In future versions of Tika, users may need to turn the TesseractOCRParser on 
> via TikaConfig.
>  Mar 14, 2021 10:15:24 PM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
>  WARNING: org.xerial's sqlite-jdbc is not loaded.
>  Please provide the jar on your classpath to parse sqlite files.
>  See tika-parsers/pom.xml for the correct version.
>  <?xml version="1.0" encoding="UTF-8" standalone="no"?>
>  <properties>
>  <!--for example: <mimeTypeRepository 
> resource="/org/apache/tika/mime/tika-mimetypes.xml"/>-->
>  <service-loader dynamic="true" loadErrorHandler="IGNORE"/>
>  <encodingDetectors>
>  <encodingDetector class="org.apache.tika.detect.DefaultEncodingDetector"/>
>  </encodingDetectors>
>  <translator class="org.apache.tika.language.translate.DefaultTranslator"/>
>  <detectors>
>  <detector class="org.apache.tika.detect.DefaultDetector"/>
>  </detectors>
>  <parsers>
>  <parser class="org.apache.tika.parser.DefaultParser"/>
>  </parsers>
>  </properties>
> {quote}
> 12 any help would be greatly appreciated. 
>  13A the odd thing is when I run something like...
>  java -jar tika-app-1.24.1.jar -t Test03.pdf output.txt
> 13B it will print the document text in powershell then print this below it 
> (which I have never gotten before)...
> {quote}Exception in thread "main" java.net.MalformedURLException: no 
> protocol: output.txt
>  at java.base/java.net.URL.<init>(URL.java:672)
>  at java.base/java.net.URL.<init>(URL.java:568)
>  at java.base/java.net.URL.<init>(URL.java:515)
>  at org.apache.tika.cli.TikaCLI.process(TikaCLI.java:488)
>  at org.apache.tika.cli.TikaCLI.main(TikaCLI.java:149)
> {quote}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to