Matt Sheppard created TIKA-2559:
-----------------------------------
Summary: Expose language metadata from PDF documents
Key: TIKA-2559
URL: https://issues.apache.org/jira/browse/TIKA-2559
Project: Tika
Issue Type: Improvement
Components: parser
Affects Versions: 2.0
Reporter: Matt Sheppard
Tika does not currently return the language from a PDF's metadata (for an
example PDF I'm seeking permission to share with you - Perhaps for all PDFs).
It would be useful to me (and I imagine others) if it could do so.
----
The example PDF I have does get a language when processed with exiftool...
{noformat}
$ exiftool -X /tmp/my-example.pdf |grep -i lang
<PDF:Language>en-US</PDF:Language>{noformat}
where as it does not with Tika.
I looked briefly into the PDF parsing code, and it appears that the language
value in question is available within PDFBox's document catalog, so I can pass
it through with a change such as...
{code:java}
diff --git
a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
index b2a15cab6..66b1c9343 100644
--- a/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
+++ b/tika-parsers/src/main/java/org/apache/tika/parser/pdf/PDFParser.java
@@ -224,7 +224,10 @@ public class PDFParser extends AbstractParser implements
Initializable {
metadata.set(AccessPermissions.CAN_PRINT_DEGRADED,
Boolean.toString(ap.canPrintDegraded()));
-
+ if (document.getDocumentCatalog().getLanguage() != null) {
+ metadata.set(Metadata.CONTENT_LANGUAGE,
document.getDocumentCatalog().getLanguage());
+ }
+
//now go for the XMP
Document dom = loadDOM(document.getDocumentCatalog().getMetadata(),
metadata, context);
diff --git
a/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
b/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
index 93966e4f2..7b7ba14fe 100644
--- a/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
+++ b/tika-parsers/src/test/java/org/apache/tika/parser/pdf/PDFParserTest.java
@@ -1310,6 +1310,14 @@ public class PDFParserTest extends TikaTest {
assertContains("Tika - Content", content);
}
+ @Test
+ public void testMissingLanguage() throws Exception {
+ Metadata metadata = getXML("my-example.pdf").metadata;
+ System.out.println(metadata);
+ assertEquals("application/pdf", metadata.get(Metadata.CONTENT_TYPE));
+ assertEquals("en-US", metadata.get(Metadata.CONTENT_LANGUAGE));
+ }
+
@Test
public void testConfiguringMoreParams() throws Exception {
try (InputStream configIs =
getClass().getResourceAsStream("/org/apache/tika/parser/pdf/tika-inline-config.xml"))
{
{code}
It's my first time looking at this code, so that change may be a bit naive, but
hopefully shows what I'm getting at.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)