[ https://issues.apache.org/jira/browse/TIKA-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Tim Allison updated TIKA-4256: ------------------------------ Description: For legacy tika, we're inlining all content from embedded files including ocr content of embedded images. However, for the RecursiveParserWrapper, /rmeta , -J option, users have to stitch inlined image ocr text back into the container file's content. For example, if a docx has an image in it and tesseract is invoked, the structure will notionally be: [ { "type":"docx", "content": "main content of the file"} { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"} ] It would be useful to allow an option to inline the extracted text in the parent document. I think we want to keep the embedded inline object so that we don't lose metadata from it. So I propose this kind of output: [ { "type":"docx", "content": "<body>main content of the file <div type=\"ocr\">ocr'd content</div></body>"} { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"} ] This proposal includes the ocr'd content marked by <div/> in the container file, and it includes the ocr'd text in the embedded image. This will allow a more intuitive search for non-file forensics users and will be more similar to what we're doing with rendering a page -> ocr in PDFs when that is configured. was: For legacy tika, we're inlining all content from embedded files including ocr content of embedded images. However, for the RecursiveParserWrapper, /rmeta , -J option, users have to stitch inlined image ocr text back into the container file's content. For example, if a docx has an image in it and tesseract is invoked, the structure will notionally be: [ { "type":"docx", "content": "main content of the file"} { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"} ] It would be useful to allow an option to inline the extracted text in the parent document. I think we want to keep the embedded inline object so that we don't lose metadata from it. So I propose this kind of output: [ { "type":"docx", "content": "<body>main content of the file <div type=\"ocr\">ocr'd content</div></body>"} { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"} ] This will allow a more intuitive search for non-file forensics users and will be more similar to what we're doing with rendering a page -> ocr in PDFs when that is configured. > Allow inlining of ocr'd text in container document > -------------------------------------------------- > > Key: TIKA-4256 > URL: https://issues.apache.org/jira/browse/TIKA-4256 > Project: Tika > Issue Type: Task > Reporter: Tim Allison > Priority: Major > > For legacy tika, we're inlining all content from embedded files including ocr > content of embedded images. > However, for the RecursiveParserWrapper, /rmeta , -J option, users have to > stitch inlined image ocr text back into the container file's content. > For example, if a docx has an image in it and tesseract is invoked, the > structure will notionally be: > [ > { "type":"docx", "content": "main content of the file"} > { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"} > ] > It would be useful to allow an option to inline the extracted text in the > parent document. I think we want to keep the embedded inline object so that > we don't lose metadata from it. So I propose this kind of output: > [ > { "type":"docx", "content": "<body>main content of the file <div > type=\"ocr\">ocr'd content</div></body>"} > { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"} > ] > This proposal includes the ocr'd content marked by <div/> in the container > file, and it includes the ocr'd text in the embedded image. > This will allow a more intuitive search for non-file forensics users and will > be more similar to what we're doing with rendering a page -> ocr in PDFs when > that is configured. -- This message was sent by Atlassian Jira (v8.20.10#820010)