[jira] [Commented] (PDFBOX-4280) PDFbox extracts checkboxes as question marks '?'

Tilman Hausherr (JIRA) Mon, 30 Jul 2018 09:47:51 -0700


    [ 
https://issues.apache.org/jira/browse/PDFBOX-4280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16562124#comment-16562124
 ]


Tilman Hausherr commented on PDFBOX-4280:
-----------------------------------------

I was able to extract the file from the command line with the 1.8 version by 
using "-encoding utf8". If you are doing this from java code, use 
{{PDFTextStripper("utf8")}} as constructor.

Does this work for you?

Re 2.0.11: If your file is not too big, call {{new PDFParser(new 
RandomAccessBuffer(IOUtils.toByteArray()))}}. Don't forget to close the input 
stream after that. Or make your life easy and just call PDDocument.load() on 
your stream instead of constructing a PDFParser yourself which has been 
obsolete for years but is in many third party "tutorials".

> PDFbox extracts checkboxes as question marks '?'
> ------------------------------------------------
>
>                 Key: PDFBOX-4280
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-4280
>             Project: PDFBox
>          Issue Type: Bug
>          Components: FontBox, Parsing, Text extraction
>    Affects Versions: 1.8.11
>            Reporter: fayaz baig
>            Priority: Major
>         Attachments: Apache pdfbox issue.docx, test.pdf
>
>
> Hello,
> When i try to extract the checkbox details frfom the pdf, it extracts as a ? 
> instead of ☒ or ☐.
> Attached document contains the details.
>  
> Please write to [[email protected]|mailto:[email protected]] for anymore 
> clarifications required.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-4280) PDFbox extracts checkboxes as question marks '?'

Reply via email to