[ 
https://issues.apache.org/jira/browse/PDFBOX-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985094#comment-16985094
 ] 

Maruan Sahyoun commented on PDFBOX-4692:
----------------------------------------

[~cowwoc] A general comment first - PDF was not defined in a way that it's easy 
to convert it back to any other format. It's typically an end format meant for 
electronic publishing, archiving, printing ...  ensuring a proper visual 
representation of the content. But PDF also provides a mechanism to tag content 
in a way that blocks, paragraphs, tables are  marked - that's called a tagged 
PDF. If the PDF is tagged you can use the tagging information to get the 
structural information of the PDF content. **But** many (most) PDFs are not 
tagged so now the visual content needs to be interpreted in a way that you get 
the information you are looking for.

With the file you have it seems that this is a very bad example to get the 
information because it's missing a lot of the hints other pdf files would 
provide - such as a different font format for the text content which would 
provide more help or the (in your case missing) font descriptor.

You can take a look at (and use/extend) PDFTextStripper.java and/or 
LegacyPDFStreamEngine.java. This contains a lot of hacks to deal with real 
world PDFs to get lines of text from PDFs. **But** the code is very hacky as it 
has grown over years and it's very ambitious to rewrite it without breaking the 
current text extraction (which is tested against serveral thousand PDFs). You 
can also take a look at tabula-java, which is based on pdfbox, which may or may 
not give you a better starting ground.

To summarize. Reinterpretation of a PDF into text content, tables etc. depends 
very much on the PDFs themselves. If you happen to have a completely tagged PDF 
the task can be straight forward. In your case - as a lot of the potential 
information seems to be missing - there is no other way than using some 
heuristics and tailor these to your needs.

IMHO we should close the issue as it's about documenting '... if and when 
PDFont.getFontDescriptor() may return null ...' which is mentioned in details 
in the PDF specification (and as noted above also changes over PDF spec 
versions).  But feel free to ask further question on the users mailing list 
https://pdfbox.apache.org/mailinglists.html. You are tasked with a complex 
topic.

 

> Document if and when PDFont.getFontDescriptor() may return null
> ---------------------------------------------------------------
>
>                 Key: PDFBOX-4692
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4692
>             Project: PDFBox
>          Issue Type: Improvement
>          Components: Documentation
>    Affects Versions: 2.0.17
>         Environment: Windows 10.0.18362.418
>            Reporter: Gili
>            Priority: Major
>         Attachments: image-2019-11-16-22-03-15-015.png
>
>
> Please document under which conditions {{PDFont.getFontDescriptor()}} may 
> return null and what can be done to calculate the text ascent/descent. 
> Clearly, this should be possible to calculate as the text ends up getting 
> rendered.
> Background information:
> I have a PDF file (credit card statement, so it cannot be shared easily) that 
> contains an embedded {{PDType3Font}} called "C0EX06Q0". When I invoke 
> {{PDFont.getFontDescriptor()}} I get null. 
> I have a screenshot of what it looks like. 
> !image-2019-11-16-22-03-15-015.png|thumbnail! 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to