[
https://issues.apache.org/jira/browse/PDFBOX-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tim Allison updated PDFBOX-3068:
--------------------------------
Description:
Tilman's observation on 'Microsoft' below revealed 1) that we should use our
BodyContentHandler so that title metadata doesn't slip into the body content
and 2) the title and all metadata values from PDDocumentInformation is null for
at least: NZ/NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU
{code}
Path p = Paths.get("..NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU");
PDDocument d = PDDocument.load(p.toFile());
assertNull(d.getDocumentInformation().getTitle());
assertEquals(8, d.getDocumentInformation().getMetadataKeys().size());
{code}
Manually reviewing a handful of documents in the
metadata/metadata_value_count_diffs.csv file
[here|https://github.com/tballison/share/blob/master/pdfbox_comparisons/pdfbox_1_8_10V2_0_20151023.zip],
this looks to be quite pervasive...unless I'm botching the right way to load
the documents and metadata.
was:
Tilman's observation on 'Microsoft' below revealed 1) that we should use our
BodyContentHandler so that title metadata doesn't slip into the body content
and 2) the title and all metadata values from PDDocumentInformation is null for
at least: NZ/NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU
{code}
Path p = Paths.get("..NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU");
PDDocument d = PDDocument.load(p.toFile());
assertNull(d.getDocumentInformation().getTitle());
assertEquals(8, d.getDocumentInformation().getMetadataKeys().size());
{code}
> Null metadata in some files that had metadata in 1.8.10
> -------------------------------------------------------
>
> Key: PDFBOX-3068
> URL: https://issues.apache.org/jira/browse/PDFBOX-3068
> Project: PDFBox
> Issue Type: Sub-task
> Reporter: Tim Allison
> Attachments: NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU
>
>
> Tilman's observation on 'Microsoft' below revealed 1) that we should use our
> BodyContentHandler so that title metadata doesn't slip into the body content
> and 2) the title and all metadata values from PDDocumentInformation is null
> for at least: NZ/NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU
> {code}
> Path p = Paths.get("..NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU");
> PDDocument d = PDDocument.load(p.toFile());
> assertNull(d.getDocumentInformation().getTitle());
> assertEquals(8, d.getDocumentInformation().getMetadataKeys().size());
> {code}
> Manually reviewing a handful of documents in the
> metadata/metadata_value_count_diffs.csv file
> [here|https://github.com/tballison/share/blob/master/pdfbox_comparisons/pdfbox_1_8_10V2_0_20151023.zip],
> this looks to be quite pervasive...unless I'm botching the right way to load
> the documents and metadata.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]