Hi,

Am 30.04.2013 23:25, schrieb Jinder Aujla:
Hi

Apologies if this is the wrong email to use. I am trying to understand if
and how well PDFBox supports extraction of text from a pdf document that
contains type 3 fonts. It's taken a while to understand the reason behind
the apparent failure in parsing.
It depends on the pdf, but most likely those pdfs don't provide a mapping so
that the text of type 3 fonts can't be extracted.

Before I go further I thought it would be better to ask, in addition I did
find this ticket in JIRA but I wasn't sure if it was still relevant.

https://issues.apache.org/jira/browse/PDFBOX-124

I can use pdftotext it's not completely successful but it does extract to
some degree. Any guidance is greatly appreciated.
It is quite easy to determine if the text of a pdf could be extracted or not.
Just perform the adobe test [1]. If adobe can't extract the text, PDFBox won't
be able to do it neither.

Thanks
Jinder


BR
Andreas Lehmkühler

[1] http://pdfbox.apache.org/userguide/faq.html#no_text_extraction

Reply via email to