[ 
https://issues.apache.org/jira/browse/PDFBOX-4806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17074634#comment-17074634
 ] 

Tilman Hausherr commented on PDFBOX-4806:
-----------------------------------------

You could use ExtractTextByArea if all the files are like this, you would have 
to decide all the coordinates for the fields because this does not have 
AcroForm (neither does the official form). If you just want a good text 
extraction, then you'll have to use OCR, e.g. tesseract and render the PDF as 
an image (Apache Tika does that if set).

Deciding whether to use OCR or not is the tricky part.

> Trying to extract the text from this PDF, getting unicodes. 
> ------------------------------------------------------------
>
>                 Key: PDFBOX-4806
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4806
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.19
>         Environment: Java
>            Reporter: Cherry Sri
>            Priority: Blocker
>         Attachments: ESt_1_A_2019.pdf
>
>
> Trying to extract the text from this PDF, getting unicodes.. Need help
>  
> ,"word_identifier":"\u0015\u0013\u0014\u001c"} 
> "word_identifier":"(LQJDQJVVWHPSHO"}
>  
> Apr 02, 2020 4:39:05 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNING: Invalid ToUnicode CMap in font DVWIYK+font00000000242bcd5a
> Apr 02, 2020 4:39:05 PM org.apache.pdfbox.pdmodel.font.PDFont loadUnicodeCmap
> WARNING: Invalid ToUnicode CMap in font BBTJPM+font00000000242bcd5a



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to