[
https://issues.apache.org/jira/browse/TIKA-2559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16346160#comment-16346160
]
Matt Sheppard commented on TIKA-2559:
-------------------------------------
Ah,
[http://itaccessibility.arizona.edu/sites/itaccessibility.arizona.edu/files/documents/acrobat-xi-pdf-accessibility-overview.pdf]
is a publicly accessible example document that shows the problem (though it's
recorded language is 'en' rather than 'en-US' in my example above).
> Expose language metadata from PDF documents
> -------------------------------------------
>
> Key: TIKA-2559
> URL: https://issues.apache.org/jira/browse/TIKA-2559
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Affects Versions: 2.0
> Reporter: Matt Sheppard
> Priority: Major
>
> Tika does not currently return the language from a PDF's metadata (for an
> example PDF I'm seeking permission to share with you - Perhaps for all PDFs).
> It would be useful to me (and I imagine others) if it could do so.
> ----
> The example PDF I have does get a language when processed with exiftool...
> {noformat}
> $ exiftool -X /tmp/my-example.pdf |grep -i lang
> <PDF:Language>en-US</PDF:Language>{noformat}
> where as it does not with Tika.
>
> I looked briefly into the PDF parsing code, and it appears that the language
> value in question is available within PDFBox's document catalog, so I can
> pass it through with a change such as...
> {code:java}
> diff --git
> a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
> b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
> index b2a15cab6..66b1c9343 100644
> --- a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
> +++ b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
> @@ -224,7 +224,10 @@ public class PDFParser extends AbstractParser implements
> Initializable {
> metadata.set(AccessPermissions.CAN_PRINT_DEGRADED,
> Boolean.toString(ap.canPrintDegraded()));
> -
> + if (document.getDocumentCatalog().getLanguage() != null) {
> + metadata.set(Metadata.CONTENT_LANGUAGE,
> document.getDocumentCatalog().getLanguage());
> + }
> +
> //now go for the XMP
> Document dom = loadDOM(document.getDocumentCatalog().getMetadata(),
> metadata, context);
> diff --git
> a/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
> b/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
> index 93966e4f2..7b7ba14fe 100644
> --- a/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
> +++ b/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
> @@ -1310,6 +1310,14 @@ public class PDFParserTest extends TikaTest {
> assertContains("Tika - Content", content);
> }
> + @Test
> + public void testMissingLanguage() throws Exception {
> + Metadata metadata = getXML("my-example.pdf").metadata;
> + System.out.println(metadata);
> + assertEquals("application/pdf", metadata.get(Metadata.CONTENT_TYPE));
> + assertEquals("en-US", metadata.get(Metadata.CONTENT_LANGUAGE));
> + }
> +
> @Test
> public void testConfiguringMoreParams() throws Exception {
> try (InputStream configIs =
> getClass().getResourceAsStream("/org/apache/tika/parser/pdf/tika-inline-config.xml"))
> {
> {code}
>
> It's my first time looking at this code, so that change may be a bit naive,
> but hopefully shows what I'm getting at.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)