[
https://issues.apache.org/jira/browse/TIKA-1294?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13997641#comment-13997641
]
Tim Allison commented on TIKA-1294:
-----------------------------------
Ha. Glad to hear that the issue I'm seeing isn't just on my docs. I've ruled
out the other image types, the issue seems to only happen with jpegs. I created
a dummy doc with a bunch of jpegs that doesn't trigger the issue. When I look
at the structure via PDFBox's PDFDebugger, the only difference that I see is
that there is a Mask:Stream node in the problematic document and no mask node
in my dummy doc. [~rpialum], have you seen this issue...do you know if PDFBOX
2.0 will fix it?
Y, that's what I was trying to describe in option 1 above...clearly not clearly
enough. I like your proposed terms. I'd want to add THUMBNAIL for those cases
where there is an image file that represents an attachment that is actually
there as happens with at least docx and rtf (but see TIKA-1283).
> Add ability to turn off extraction of PDXObjectImages (TIKA-1268) from PDFs
> ---------------------------------------------------------------------------
>
> Key: TIKA-1294
> URL: https://issues.apache.org/jira/browse/TIKA-1294
> Project: Tika
> Issue Type: Improvement
> Reporter: Tim Allison
> Priority: Trivial
> Attachments: TIKA-1294.patch
>
>
> TIKA-1268 added the capability to extract embedded images as regular embedded
> resources...a great feature!
> However, for some use cases, it might not be desirable to extract those types
> of embedded resources. I see two ways of allowing the client to choose
> whether or not to extract those images:
> 1) set a value in the metadata for the extracted images that identifies them
> as embedded PDXObjectImages vs regular image attachments. The client can
> then choose not to process embedded resources with a given metadata value.
> 2) allow the client to set a parameter in the PDFConfig object.
> My initial proposal is to go with option 2, and I'll attach a patch shortly.
--
This message was sent by Atlassian JIRA
(v6.2#6252)