[
https://issues.apache.org/jira/browse/PDFBOX-4692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16985094#comment-16985094
]
Maruan Sahyoun commented on PDFBOX-4692:
----------------------------------------
[~cowwoc] A general comment first - PDF was not defined in a way that it's easy
to convert it back to any other format. It's typically an end format meant for
electronic publishing, archiving, printing ... ensuring a proper visual
representation of the content. But PDF also provides a mechanism to tag content
in a way that blocks, paragraphs, tables are marked - that's called a tagged
PDF. If the PDF is tagged you can use the tagging information to get the
structural information of the PDF content. **But** many (most) PDFs are not
tagged so now the visual content needs to be interpreted in a way that you get
the information you are looking for.
With the file you have it seems that this is a very bad example to get the
information because it's missing a lot of the hints other pdf files would
provide - such as a different font format for the text content which would
provide more help or the (in your case missing) font descriptor.
You can take a look at (and use/extend) PDFTextStripper.java and/or
LegacyPDFStreamEngine.java. This contains a lot of hacks to deal with real
world PDFs to get lines of text from PDFs. **But** the code is very hacky as it
has grown over years and it's very ambitious to rewrite it without breaking the
current text extraction (which is tested against serveral thousand PDFs). You
can also take a look at tabula-java, which is based on pdfbox, which may or may
not give you a better starting ground.
To summarize. Reinterpretation of a PDF into text content, tables etc. depends
very much on the PDFs themselves. If you happen to have a completely tagged PDF
the task can be straight forward. In your case - as a lot of the potential
information seems to be missing - there is no other way than using some
heuristics and tailor these to your needs.
IMHO we should close the issue as it's about documenting '... if and when
PDFont.getFontDescriptor() may return null ...' which is mentioned in details
in the PDF specification (and as noted above also changes over PDF spec
versions). But feel free to ask further question on the users mailing list
https://pdfbox.apache.org/mailinglists.html. You are tasked with a complex
topic.
> Document if and when PDFont.getFontDescriptor() may return null
> ---------------------------------------------------------------
>
> Key: PDFBOX-4692
> URL: https://issues.apache.org/jira/browse/PDFBOX-4692
> Project: PDFBox
> Issue Type: Improvement
> Components: Documentation
> Affects Versions: 2.0.17
> Environment: Windows 10.0.18362.418
> Reporter: Gili
> Priority: Major
> Attachments: image-2019-11-16-22-03-15-015.png
>
>
> Please document under which conditions {{PDFont.getFontDescriptor()}} may
> return null and what can be done to calculate the text ascent/descent.
> Clearly, this should be possible to calculate as the text ends up getting
> rendered.
> Background information:
> I have a PDF file (credit card statement, so it cannot be shared easily) that
> contains an embedded {{PDType3Font}} called "C0EX06Q0". When I invoke
> {{PDFont.getFontDescriptor()}} I get null.
> I have a screenshot of what it looks like.
> !image-2019-11-16-22-03-15-015.png|thumbnail!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]