[
https://issues.apache.org/jira/browse/PDFBOX-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029213#comment-16029213
]
Maruan Sahyoun commented on PDFBOX-3814:
----------------------------------------
[~rscharpf] these are the bits extracted by Adobe Reader
{quote}
3FMFBTF
"QSJM
%BUB%JSFDU
$POOFDU4FSJFTGPS0%#$
6TFST(VJEFBOE3FGFSFODF
¦%BUB%JSFDU5FDIOPMPHJFT$PSQ"MMSJHIUTSFTFSWFE1SJOUFEJOUIF64"
%BUB%JSFDU
{quote}
The reason the text is nit extracted correctly are missing toUnicode values for
the characters. The fact that the PDF is displayed correctly is not relevant
for text extraction see https://pdfbox.apache.org/2.0/faq.html#textorder
> PDFTextStripper extracts garbadge
> ---------------------------------
>
> Key: PDFBOX-3814
> URL: https://issues.apache.org/jira/browse/PDFBOX-3814
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 2.0.5, 2.0.6
> Environment: Windows 7 64-bit, Java
> Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
> Reporter: Robert Scharpf
> Labels: patch
> Attachments: DataDirect Connect for ODBC User's Guide and
> Reference.pdf
>
>
> Adobe Reader shows no problems with the attached PDF "DataDirect Connect for
> ODBC User's Guide and Reference.pdf".
> First 256 characters of extracted text (char + hex code) from PDFTextStripper:
> 000d
> 000d
> 000d
> 000d
> 000d
> 000d
> 000d
> 000d
> 000d 0001 B 0042 O 004f E 0045 0001 4 0034 F 0046 R 0052 V 0056 F 0046 -
> 002d J 004a O 004f L 004c 0001 B 0042 S 0053 F 0046 0001 S 0053 F 0046 H
> 0048 J 004a T 0054 U 0055 F 0046
> I have a few more PDFs with the same symptom.
--
This message was sent by Atlassian JIRA
(v6.3.15#6346)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]