[jira] [Comment Edited] (TIKA-3035) Tika-app --extract mode outputs to stderr instead of stdout

Tim Allison (Jira) Tue, 25 Feb 2020 07:30:48 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17044573#comment-17044573
 ]


Tim Allison edited comment on TIKA-3035 at 2/25/20 3:29 PM:
------------------------------------------------------------

Alright, IIUC from private communication, I shouldn't have accepted the initial 
patch.  

# I didn't realize it was even possible to extract attachments {{-z}} and full 
recursive parsing {{-J}} in the same command!  Congratulations!!!
# Let's not do that.

Unless there are objections, I'll revert the {{-z}} behavior to print progress 
to STDOUT.

Longer term, e.g. Tika 2.0, we could really do to fix up the commandline 
processing and maybe use an actual commandline parser library, e.g. commons-cli.

FTR: I accept serious fault for introducing some true nuttiness in the 
commandline, but here we are...


was (Author: [email protected]):
Alright, IIUC from private communication, I shouldn't have accepted the initial 
patch.  

# I didn't realize it was even possible to extract attachments {{-z}} and 
{{-J}} in the same command!  Congratulations!!!
# Let's not do that.

Unless there are objections, I'll revert the {{-z}} behavior to print progress 
to STDOUT.

Longer term, e.g. Tika 2.0, we could really do to fix up the commandline 
processing and maybe use an actual commandline parser library, e.g. commons-cli.

FTR: I accept serious fault for introducing some true nuttiness in the 
commandline, but here we are...

> Tika-app --extract mode outputs to stderr instead of stdout
> -----------------------------------------------------------
>
>                 Key: TIKA-3035
>                 URL: https://issues.apache.org/jira/browse/TIKA-3035
>             Project: Tika
>          Issue Type: Bug
>          Components: app
>    Affects Versions: 1.23
>            Reporter: Soren Daugaard
>            Priority: Major
>              Labels: app, extract
>         Attachments: testPDF_childAttachments.pdf
>
>
> In version 1.23 of Tika I am noticing a problem using the extract 
> functionality. When extracting items from a file the "Extracting ... to ... " 
> output goes to {{stderr}} instead of {{stdout}}.  
> This problem is observed using the runnable jar `tika-app-1.23.jar` . 
> _*Example to re-create problem:*_
> Here we explode {{testPDF_childAttachments.pdf}} and redirects standard error 
> to /{{dev/null}}:
> {code:java}
> $ java -jar tika-app-1.23.jar --extract-dir=tika-test/out/ -z 
> testPDF_childAttachments.pdf 2> /dev/null
> {code}
> If I do not redirect stderr I see:
> {code:java}
> $ java -jar tika-app-1.23.jar --extract-dir=tika-test/out/ -z 
> testPDF_childAttachments.pdf
> INFO  As a convenience, TikaCLI has turned on extraction of
> inline images for the PDFParser (TIKA-2374).
> Aside from the -z option, this is not the default behavior
> in Tika generally or in tika-server.
> Jan 31, 2020 8:06:01 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.Jan 31, 2020 8:06:01 PM 
> org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: Tesseract OCR is installed and will be automatically applied to 
> image files unless
> you've excluded the TesseractOCRParser from the default parser.
> Tesseract may dramatically slow down content extraction (TIKA-2359).
> As of Tika 1.15 (and prior versions), Tesseract is automatically called.
> In future versions of Tika, users may need to turn the TesseractOCRParser on 
> via TikaConfig.
> Jan 31, 2020 8:06:01 PM org.apache.tika.config.InitializableProblemHandler$3 
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> Extracting 'image0.jpg' (image/jpeg) to 
> tika-test/out/3975acae-089c-43ae-a3bc-04e4987a0282-image0.jpg
> Extracting 'image1.tif' (image/tiff) to 
> tika-test/out/8d11e4e3-735b-4b0b-9441-3ed4332c2f53-image1.tif
> WARN  No Unicode mapping for f_i (31) in font SCZFMD+HelveticaNeueLTStd-Roman
> Extracting 'Press Quality(1).joboptions' (text/plain) to 
> tika-test/out/28c3fb48-30ea-403b-8a35-252c8f692305-Press Quality(1).joboptions
> Extracting 'Unit10.doc' (application/msword) to 
> tika-test/out/008b9157-75f3-453b-bdfd-d5403c56891c-Unit10.doc
> {code}
> Using 1.22 I correctly see the extracted files in {{stdout}} when redirecting 
> {{stderr}}:
> {code:java}
> $ java -jar tika-app-1.22.jar --extract-dir=tika-test/out/ -z 
> testPDF_childAttachments.pdf 2> /dev/null
> Extracting 'image0.jpg' (image/jpeg) to 
> tika-test/out/4ec61a12-4e5f-4de3-bee8-fa15521c374a-image0.jpg
> Extracting 'image1.tif' (image/tiff) to 
> tika-test/out/004fbeb5-4b0e-4d35-8c50-23a420dccc99-image1.tif
> Extracting 'Press Quality(1).joboptions' (text/plain) to 
> tika-test/out/8f6174d1-f0c7-4143-990d-a922c2e9513a-Press Quality(1).joboptions
> Extracting 'Unit10.doc' (application/msword) to 
> tika-test/out/b2508bee-745d-4051-b927-0f5c31b97c1e-Unit10.doc
> {code}
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (TIKA-3035) Tika-app --extract mode outputs to stderr instead of stdout

Reply via email to