[jira] [Commented] (PDFBOX-3814) PDFTextStripper extracts garbadge

Maruan Sahyoun (JIRA) Tue, 30 May 2017 02:31:20 -0700

    [ 
https://issues.apache.org/jira/browse/PDFBOX-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16029213#comment-16029213
 ]


Maruan Sahyoun commented on PDFBOX-3814:
----------------------------------------

[~rscharpf] these are the bits extracted by Adobe Reader

{quote}
3FMFBTF
"QSJM
%BUB%JSFDU
$POOFDU4FSJFTGPS0%#$
6TFST(VJEFBOE3FGFSFODF
¦%BUB%JSFDU5FDIOPMPHJFT$PSQ"MMSJHIUTSFTFSWFE1SJOUFEJOUIF64"
%BUB%JSFDU
{quote}

The reason the text is nit extracted correctly are missing toUnicode values for 
the characters. The fact that the PDF is displayed correctly is not relevant 
for text extraction see https://pdfbox.apache.org/2.0/faq.html#textorder


> PDFTextStripper extracts garbadge
> ---------------------------------
>
>                 Key: PDFBOX-3814
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-3814
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 2.0.5, 2.0.6
>         Environment: Windows 7 64-bit, Java 
> Java(TM) SE Runtime Environment (build 1.8.0_121-b13)
>            Reporter: Robert Scharpf
>              Labels: patch
>         Attachments: DataDirect Connect for ODBC User's Guide and 
> Reference.pdf
>
>
> Adobe Reader shows no problems with the attached PDF "DataDirect Connect for 
> ODBC User's Guide and Reference.pdf". 
> First 256 characters of extracted text (char + hex code) from PDFTextStripper:
>  000d 
>  000d 
>  000d 
>  000d 
>  000d 
>  000d 
>  000d 
>  000d 
>  000d  0001 B 0042 O 004f E 0045  0001 4 0034 F 0046 R 0052 V 0056 F 0046 - 
> 002d J 004a O 004f L 004c  0001 B 0042 S 0053 F 0046  0001 S 0053 F 0046 H 
> 0048 J 004a T 0054 U 0055 F 0046 
> I have a few more PDFs with the same symptom.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (PDFBOX-3814) PDFTextStripper extracts garbadge

Reply via email to