[ 
https://issues.apache.org/jira/browse/PDFBOX-1792?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13837907#comment-13837907
 ] 

Maruan Sahyoun commented on PDFBOX-1792:
----------------------------------------

Could you share with us - or point us to the source - how you do the 
extraction? Using the ExtractText command line tool both options produce the 
same result, which is that the text within the annotation is not extracted. 

In addition the following code

        PDDocument document = PDDocument.loadNonSeq(new 
File("testAnnotations.pdf"), null);
        PDDocumentInformation docInfo = document.getDocumentInformation();
        PDDocumentCatalog catalog = document.getDocumentCatalog();
        List<PDAnnotation> la = 
((PDPage)catalog.getAllPages().get(0)).getAnnotations();
        String annotationText = la.get(0).getContents();

Gives you the same content using the NonSequentalPDFParser and the ‚classic‘ 
parser i.e. 'Here is a comment‘.

All testes done using pdfbox-1.8.3.

BR
Maruan 

> Metadata not completely extracted with NonSequentialPDFParser on some 
> documents
> -------------------------------------------------------------------------------
>
>                 Key: PDFBOX-1792
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-1792
>             Project: PDFBox
>          Issue Type: Bug
>          Components: PDModel
>    Affects Versions: 1.8.3
>            Reporter: Tim Allison
>            Priority: Minor
>         Attachments: PDFBOX-1792.tar.gz
>
>
> The traditional parser is able to extract metadata from the Annotation test 
> document from TIKA-738.  The NonSequentialPDFParser is not able to extract 
> metadata.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to