[
https://issues.apache.org/jira/browse/PDFBOX-4881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tilman Hausherr closed PDFBOX-4881.
-----------------------------------
Resolution: Won't Do
Closing this one, as it is almost impossible. An ideal solution would be to
reconstruct /ToUnicode by using OCR for the individual glyphs, or better: by
using the OCR results to create a huge database with glyph outlines. This would
be a nice project for companies doing OCR as a service, because one would have
better text extraction than from "pure" OCR.
> Is it possible to properly extract text from this pdf?
> ------------------------------------------------------
>
> Key: PDFBOX-4881
> URL: https://issues.apache.org/jira/browse/PDFBOX-4881
> Project: PDFBox
> Issue Type: Wish
> Reporter: Alfred
> Priority: Trivial
> Attachments: Farsi.pdf
>
>
> This PDF has farsi characters, but probably the char codes are wrong and
> probably no mapping table.
> If there's any work to be done to support Farsi I would be happy to do that
> myself, I just need a pointer to the right direction.
>
> Thank you!
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]