[ 
https://issues.apache.org/jira/browse/PDFBOX-3068?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Allison updated PDFBOX-3068:
--------------------------------
    Description: 
Tilman's observation on 'Microsoft' below revealed 1) that we should use our 
BodyContentHandler so that title metadata doesn't slip into the body content 
and 2) the title and all metadata values from PDDocumentInformation is null for 
at least: NZ/NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU

{code}
        Path p = Paths.get("..NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU");
        PDDocument d = PDDocument.load(p.toFile());
        assertNull(d.getDocumentInformation().getTitle());
        assertEquals(8, d.getDocumentInformation().getMetadataKeys().size());
{code} 

Manually reviewing a handful of documents in the 
metadata/metadata_value_count_diffs.csv file 
[here|https://github.com/tballison/share/blob/master/pdfbox_comparisons/pdfbox_1_8_10V2_0_20151023.zip],
 this looks to be quite pervasive...unless I'm botching the right way to load 
the documents and metadata.

  was:
Tilman's observation on 'Microsoft' below revealed 1) that we should use our 
BodyContentHandler so that title metadata doesn't slip into the body content 
and 2) the title and all metadata values from PDDocumentInformation is null for 
at least: NZ/NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU

{code}
        Path p = Paths.get("..NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU");
        PDDocument d = PDDocument.load(p.toFile());
        assertNull(d.getDocumentInformation().getTitle());
        assertEquals(8, d.getDocumentInformation().getMetadataKeys().size());
{code} 


> Null metadata in some files that had metadata in 1.8.10
> -------------------------------------------------------
>
>                 Key: PDFBOX-3068
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3068
>             Project: PDFBox
>          Issue Type: Sub-task
>            Reporter: Tim Allison
>         Attachments: NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU
>
>
> Tilman's observation on 'Microsoft' below revealed 1) that we should use our 
> BodyContentHandler so that title metadata doesn't slip into the body content 
> and 2) the title and all metadata values from PDDocumentInformation is null 
> for at least: NZ/NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU
> {code}
>         Path p = Paths.get("..NZAZKTQYKDD2HSBCSJJN6XSEA4KJEONU");
>         PDDocument d = PDDocument.load(p.toFile());
>         assertNull(d.getDocumentInformation().getTitle());
>         assertEquals(8, d.getDocumentInformation().getMetadataKeys().size());
> {code} 
> Manually reviewing a handful of documents in the 
> metadata/metadata_value_count_diffs.csv file 
> [here|https://github.com/tballison/share/blob/master/pdfbox_comparisons/pdfbox_1_8_10V2_0_20151023.zip],
>  this looks to be quite pervasive...unless I'm botching the right way to load 
> the documents and metadata.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to