[
https://issues.apache.org/jira/browse/PDFBOX-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14980765#comment-14980765
]
Tilman Hausherr commented on PDFBOX-3068:
-----------------------------------------
Not a regression: it has never worked, I had forgotten that older versions used
the old parser. I tested with the non sequential parser down to 1.8.1 and got
null for title.
>From debugging I suspected that the cause is that the /Info elements are
>indirect objects.
What does help (although I don't know if this is the best solution) is to add
this at the end of PDFParser.initialParse():
{code}
COSBase infoBase = trailer.getDictionaryObject(COSName.INFO);
if (infoBase instanceof COSDictionary)
{
parseDictObjects((COSDictionary) infoBase, (COSName[]) null);
}
{code}
> Null metadata in some files that had metadata in 1.8.10
> -------------------------------------------------------
>
> Key: PDFBOX-3068
> URL: https://issues.apache.org/jira/browse/PDFBOX-3068
> Project: PDFBox
> Issue Type: Sub-task
> Components: Parsing
> Affects Versions: 2.0.0
> Reporter: Tim Allison
> Fix For: 2.0.0
>
> Attachments: NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU
>
>
> Tilman's observation on 'Microsoft' below revealed 1) that we should use our
> BodyContentHandler so that title metadata doesn't slip into the body content
> and 2) the title and all metadata values from PDDocumentInformation is null
> for at least: NZ/NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU
> {code}
> Path p = Paths.get("..NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU");
> PDDocument d = PDDocument.load(p.toFile());
> assertNull(d.getDocumentInformation().getTitle());
> assertEquals(8, d.getDocumentInformation().getMetadataKeys().size());
> {code}
> Manually reviewing a handful of documents in the
> metadata/metadata_value_count_diffs.csv file
> [here|https://github.com/tballison/share/blob/master/pdfbox_comparisons/pdfbox_1_8_10V2_0_20151023.zip],
> this looks to be quite pervasive...unless I'm botching the right way to load
> the documents and metadata.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]