[ 
https://issues.apache.org/jira/browse/TIKA-4256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated TIKA-4256:
------------------------------
    Description: 
For legacy tika, we're inlining all content from embedded files including ocr 
content of embedded images.

However, for the RecursiveParserWrapper, /rmeta , -J option, users have to 
stitch inlined image ocr text back into the container file's content.

For example, if a docx has an image in it and tesseract is invoked, the 
structure will notionally be:
[
  { "type":"docx", "content": "main content of the file"}
  { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
]

It would be useful to allow an option to inline the extracted text in the 
parent document. I think we want to keep the embedded inline object so that we 
don't lose metadata from it. So I propose this kind of output:

[
  { "type":"docx", "content": "<body>main content of the file <div 
type=\"ocr\">ocr'd content</div></body>"}
  { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
]

This proposal includes the ocr'd content marked by <div/> in the container 
file, and it includes the ocr'd text in the embedded image.

This will allow a more intuitive search for non-file forensics users and will 
be more similar to what we're doing with rendering a page -> ocr in PDFs when 
that is configured.



  was:
For legacy tika, we're inlining all content from embedded files including ocr 
content of embedded images.

However, for the RecursiveParserWrapper, /rmeta , -J option, users have to 
stitch inlined image ocr text back into the container file's content.

For example, if a docx has an image in it and tesseract is invoked, the 
structure will notionally be:
[
  { "type":"docx", "content": "main content of the file"}
  { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
]

It would be useful to allow an option to inline the extracted text in the 
parent document. I think we want to keep the embedded inline object so that we 
don't lose metadata from it. So I propose this kind of output:

[
  { "type":"docx", "content": "<body>main content of the file <div 
type=\"ocr\">ocr'd content</div></body>"}
  { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
]

This will allow a more intuitive search for non-file forensics users and will 
be more similar to what we're doing with rendering a page -> ocr in PDFs when 
that is configured.


> Allow inlining of ocr'd text in container document
> --------------------------------------------------
>
>                 Key: TIKA-4256
>                 URL: https://issues.apache.org/jira/browse/TIKA-4256
>             Project: Tika
>          Issue Type: Task
>            Reporter: Tim Allison
>            Priority: Major
>
> For legacy tika, we're inlining all content from embedded files including ocr 
> content of embedded images.
> However, for the RecursiveParserWrapper, /rmeta , -J option, users have to 
> stitch inlined image ocr text back into the container file's content.
> For example, if a docx has an image in it and tesseract is invoked, the 
> structure will notionally be:
> [
>   { "type":"docx", "content": "main content of the file"}
>   { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
> ]
> It would be useful to allow an option to inline the extracted text in the 
> parent document. I think we want to keep the embedded inline object so that 
> we don't lose metadata from it. So I propose this kind of output:
> [
>   { "type":"docx", "content": "<body>main content of the file <div 
> type=\"ocr\">ocr'd content</div></body>"}
>   { "type":"jpeg", "content": "ocr'd content", "embeddedType":"INLINE"}
> ]
> This proposal includes the ocr'd content marked by <div/> in the container 
> file, and it includes the ocr'd text in the embedded image.
> This will allow a more intuitive search for non-file forensics users and will 
> be more similar to what we're doing with rendering a page -> ocr in PDFs when 
> that is configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to