[
https://issues.apache.org/jira/browse/TIKA-1232?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13894376#comment-13894376
]
Andrew Jackson commented on TIKA-1232:
--------------------------------------
Great!
For (1), very happy for that code to go to PDFBox. I'm pretty sure PDFBox
doesn't already do anything along those lines, but I am not all that familiar
with that codebase so it's worth checking first.
As for (2), I've only tested on a fairly small number of PDFs because only the
more recent versions of the Adobe tools actually make use of them, and even
then, only when necessary. I ran that code against a web archive corpus
containing around 2 billion resources, including many millions of PDFs, but
because that dataset only ran up to 2010, I found a grand total of eight PDFs
that used Adobe Extension Level 3. It worked fine on those!
Finally, on the metadata property scheme, I feel the 'right place' is as a
parameter on the Content Type, but I accept that may confuse client code (i.e.
people assuming type.equals("application/pdf") should always work, even though
that would be no good for other types like HTML due to the charset parameter).
Note that the parameter approach also allows you to do version detection in
Tika's
[custom-mimetypes.xml|https://github.com/openplanets/nanite/blob/master/nanite-core/src/main/resources/org/apache/tika/mime/custom-mimetypes.xml#L357],
which I find rather handy. Of course, you are also welcome to take any of
those signatures if they are of interest.
> Add PDF version to PDFParser output
> -----------------------------------
>
> Key: TIKA-1232
> URL: https://issues.apache.org/jira/browse/TIKA-1232
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 1.5
> Environment: JDK6
> Reporter: William Palmer
> Assignee: Tim Allison
> Priority: Minor
> Attachments: pdfversion.patch
>
>
> I'd like to identify the PDF version of files, this is not currently reported
> by the PDFParser although the information is available via PDFBox. I have
> attached a patch that adds the format version to the Metadata object.
> However, I am not familiar enough with the Tika source to know if an
> alternative metadata key should be used, or this new one added.
> Comments welcome.
--
This message was sent by Atlassian JIRA
(v6.1.5#6160)