[
https://issues.apache.org/jira/browse/TIKA-3035?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17043802#comment-17043802
]
Soren Daugaard commented on TIKA-3035:
--------------------------------------
I feel like this should be fixed since listing the embedded files that was
extracted is valid output. It is not an error or diagnostic information. In my
opinion stderr should be reserved for errors or diagnostic messages not for
useable program output.
If the concern is that STDOUT should only contain clean JSON a solution could
be to add a JSON format option for listing the embedded files.
> Tika-app --extract mode outputs to stderr instead of stdout
> -----------------------------------------------------------
>
> Key: TIKA-3035
> URL: https://issues.apache.org/jira/browse/TIKA-3035
> Project: Tika
> Issue Type: Bug
> Components: app
> Affects Versions: 1.23
> Reporter: Soren Daugaard
> Priority: Major
> Labels: app, extract
> Attachments: testPDF_childAttachments.pdf
>
>
> In version 1.23 of Tika I am noticing a problem using the extract
> functionality. When extracting items from a file the "Extracting ... to ... "
> output goes to {{stderr}} instead of {{stdout}}.
> This problem is observed using the runnable jar `tika-app-1.23.jar` .
> _*Example to re-create problem:*_
> Here we explode {{testPDF_childAttachments.pdf}} and redirects standard error
> to /{{dev/null}}:
> {code:java}
> $ java -jar tika-app-1.23.jar --extract-dir=tika-test/out/ -z
> testPDF_childAttachments.pdf 2> /dev/null
> {code}
> If I do not redirect stderr I see:
> {code:java}
> $ java -jar tika-app-1.23.jar --extract-dir=tika-test/out/ -z
> testPDF_childAttachments.pdf
> INFO As a convenience, TikaCLI has turned on extraction of
> inline images for the PDFParser (TIKA-2374).
> Aside from the -z option, this is not the default behavior
> in Tika generally or in tika-server.
> Jan 31, 2020 8:06:01 PM org.apache.tika.config.InitializableProblemHandler$3
> handleInitializableProblem
> WARNING: J2KImageReader not loaded. JPEG2000 files will not be processed.
> See https://pdfbox.apache.org/2.0/dependencies.html#jai-image-io
> for optional dependencies.Jan 31, 2020 8:06:01 PM
> org.apache.tika.config.InitializableProblemHandler$3
> handleInitializableProblem
> WARNING: Tesseract OCR is installed and will be automatically applied to
> image files unless
> you've excluded the TesseractOCRParser from the default parser.
> Tesseract may dramatically slow down content extraction (TIKA-2359).
> As of Tika 1.15 (and prior versions), Tesseract is automatically called.
> In future versions of Tika, users may need to turn the TesseractOCRParser on
> via TikaConfig.
> Jan 31, 2020 8:06:01 PM org.apache.tika.config.InitializableProblemHandler$3
> handleInitializableProblem
> WARNING: org.xerial's sqlite-jdbc is not loaded.
> Please provide the jar on your classpath to parse sqlite files.
> See tika-parsers/pom.xml for the correct version.
> Extracting 'image0.jpg' (image/jpeg) to
> tika-test/out/3975acae-089c-43ae-a3bc-04e4987a0282-image0.jpg
> Extracting 'image1.tif' (image/tiff) to
> tika-test/out/8d11e4e3-735b-4b0b-9441-3ed4332c2f53-image1.tif
> WARN No Unicode mapping for f_i (31) in font SCZFMD+HelveticaNeueLTStd-Roman
> Extracting 'Press Quality(1).joboptions' (text/plain) to
> tika-test/out/28c3fb48-30ea-403b-8a35-252c8f692305-Press Quality(1).joboptions
> Extracting 'Unit10.doc' (application/msword) to
> tika-test/out/008b9157-75f3-453b-bdfd-d5403c56891c-Unit10.doc
> {code}
> Using 1.22 I correctly see the extracted files in {{stdout}} when redirecting
> {{stderr}}:
> {code:java}
> $ java -jar tika-app-1.22.jar --extract-dir=tika-test/out/ -z
> testPDF_childAttachments.pdf 2> /dev/null
> Extracting 'image0.jpg' (image/jpeg) to
> tika-test/out/4ec61a12-4e5f-4de3-bee8-fa15521c374a-image0.jpg
> Extracting 'image1.tif' (image/tiff) to
> tika-test/out/004fbeb5-4b0e-4d35-8c50-23a420dccc99-image1.tif
> Extracting 'Press Quality(1).joboptions' (text/plain) to
> tika-test/out/8f6174d1-f0c7-4143-990d-a922c2e9513a-Press Quality(1).joboptions
> Extracting 'Unit10.doc' (application/msword) to
> tika-test/out/b2508bee-745d-4051-b927-0f5c31b97c1e-Unit10.doc
> {code}
>
>
--
This message was sent by Atlassian Jira
(v8.3.4#803005)