[jira] [Closed] (PDFBOX-4881) Is it possible to properly extract text from this pdf?

Tilman Hausherr (Jira) Fri, 19 Jun 2020 01:39:18 -0700


     [ 
https://issues.apache.org/jira/browse/PDFBOX-4881?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Tilman Hausherr closed PDFBOX-4881.
-----------------------------------
    Resolution: Won't Do

Closing this one, as it is almost impossible. An ideal solution would be to 
reconstruct /ToUnicode by using OCR for the individual glyphs, or better: by 
using the OCR results to create a huge database with glyph outlines. This would 
be a nice project for companies doing OCR as a service, because one would have 
better text extraction than from "pure" OCR.

> Is it possible to properly extract text from this pdf?
> ------------------------------------------------------
>
>                 Key: PDFBOX-4881
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4881
>             Project: PDFBox
>          Issue Type: Wish
>            Reporter: Alfred
>            Priority: Trivial
>         Attachments: Farsi.pdf
>
>
> This PDF has farsi characters, but probably the char codes are wrong and 
> probably no mapping table.
> If there's any work to be done to support Farsi I would be happy to do that 
> myself, I just need a pointer to the right direction.
>  
> Thank you!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Closed] (PDFBOX-4881) Is it possible to properly extract text from this pdf?

Reply via email to