[
https://issues.apache.org/jira/browse/TIKA-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17847909#comment-17847909
]
Hudson commented on TIKA-4256:
------------------------------
SUCCESS: Integrated in Jenkins build Tika ยป tika-main-jdk11 #1634 (See
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk11/1634/])
TIKA-4256 -- allow inlining of ocr'd content in the RecursiveParserWrapper
(#1762) (github:
[https://github.com/apache/tika/commit/7a03331f87e44548b30970b66d24a81823bc68ab])
* (add)
tika-core/src/main/java/org/apache/tika/extractor/ParentContentHandler.java
* (edit)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRParser.java
* (edit)
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-ocr-module/src/main/java/org/apache/tika/parser/ocr/TesseractOCRConfig.java
* (edit)
tika-parsers/tika-parsers-standard/tika-parsers-standard-package/src/test/java/org/apache/tika/parser/ocr/TesseractOCRParserTest.java
* (edit)
tika-core/src/main/java/org/apache/tika/parser/RecursiveParserWrapper.java
> Allow inlining of ocr'd text in container document
> --------------------------------------------------
>
> Key: TIKA-4256
> URL: https://issues.apache.org/jira/browse/TIKA-4256
> Project: Tika
> Issue Type: Task
> Reporter: Tim Allison
> Priority: Major
>
> For legacy tika, we're inlining all content from embedded files including ocr
> content of embedded images.
> However, for the RecursiveParserWrapper, /rmeta , -J option, users have to
> stitch inlined image ocr text back into the container file's content.
> For example, if a docx has an image in it and tesseract is invoked, the
> structure will notionally be:
> [
> { "type":"docx", "content": "main content of the file"}
> { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
> ]
> It would be useful to allow an option to inline the extracted text in the
> parent document. I think we want to keep the embedded inline object so that
> we don't lose metadata from it. So I propose this kind of output:
> [
> { "type":"docx", "content": "<body>main content of the file <div
> type=\"ocr\">ocr'd content</div></body>"}
> { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
> ]
> This proposal includes the ocr'd content marked by <div/> in the container
> file, and it includes the ocr'd text in the embedded image.
> For now this proposal does not include inlining ocr'd text from thumbnails.
> We can do that on a later ticket if desired.
> This will allow a more intuitive search for non-file forensics users and will
> be more similar to what we're doing with rendering a page -> ocr in PDFs when
> that is configured.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)