[jira] Updated: (PDFBOX-534) PDF file created with LaTeX is bad parsed

Thomas Fischer (JIRA) Sun, 16 May 2010 05:13:10 -0700

     [ 
https://issues.apache.org/jira/browse/PDFBOX-534?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Thomas Fischer updated PDFBOX-534:
----------------------------------

    Attachment: amapn19_03.pdf
                amapn19_03.txt

Here is another example with the extracted text looking like
a0a2a1a4a3a6a5a8a7a9a5a10a3a12a11a14a13a16a15a17a5a10a3a19a18a20a1a21a5a17a0a2a1a22a5a24a23a24...
But in this case, the file was created using ESP Ghostscript 7.05 and the text 
cannot be retrieved using either Acrobat Reader or Preview. The former produces 
text that looks like

243658799:;0<>=:
?...@9
A>A>ACBEDGF%HIJBK4D4LM>N*O4PQ/RSI
TVUXWTVUYZUX[]\^'\"acb[dbfegWTfbhjilkm\nYpo]YZ\"UXqrYsbf[t\"u]U
a[]\"UXTvw^yx{z9/|}~
ZJ>

, the latter like

JJ
ö8 􏰙8Gö. Öö􏰳é0ö#(64õ(<􏰲ö0éö#( 4õ􏱍óF(+ÆÈ/􏰬Ì*9􏰲DõöE(ó[
G .ö F
􏰛HöI

Either is completely unintelligible.

> PDF file created with LaTeX is bad parsed
> -----------------------------------------
>
>                 Key: PDFBOX-534
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-534
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Parsing
>    Affects Versions: 0.8.0-incubator
>         Environment: Linux/Ubuntu 9
>            Reporter: Ernesto De Santis
>         Attachments: amapn19_03.pdf, amapn19_03.txt, kvfs-PDFKit.txt, 
> kvfs.pdf, kvfs.txt
>
>
> I'm getting an unexpected behavior parsing a pdf file.
> I'm trying to get the clean body text of some file, and I get a lot of aXX 
> strings. Where each X is a number. It appear be the char code of the real 
> character, I don't know really.
> My code is too simple:
>           String[] args = {"/home/ernesto/tesis/documento/kvfs.pdf"};
>           ExtractText.main(args);
> I used the PDFBox 0.8.0-incubator version. Builded on 20/9/2009. 
> The output I get is:
> a73a109a112a108a101a109a101a110a116a97a110a100a111 a97a99a99a101a115a111 a97 
> a115a105a115a116a101a109a97a115 a100a101
> a97a114a99a104a105a118a111a115 a118a105a114a116a117a97a108a101a115 
> a112a97a114a97 a108a97 a104a101a114a114a97a109a105a101a110a116a97
> a100a101 a98a250a115a113a117a101a100a97 a75a110a101a111a98a97a115a101
> and more ......
> The pdf file was generated by pdflatex command, in Ubuntu 9.
> The pdf properties are:
> producer: pdfTeX-1.40.3
> format: PDF-1.4
> security: NO
> optimized: NO
> paper: A4, vertical (210 x 297 mm)
> When I run the PDFBox test, I get this by the console:
> 0 [main] INFO org.apache.pdfbox.util.PDFStreamEngine  - unsupported/disabled 
> operation: d
> INFO  [main]: unsupported/disabled operation: d
> 7 [main] INFO org.apache.pdfbox.util.PDFStreamEngine  - unsupported/disabled 
> operation: J
> INFO  [main]: unsupported/disabled operation: J
> 7 [main] INFO org.apache.pdfbox.util.PDFStreamEngine  - unsupported/disabled 
> operation: m
> INFO  [main]: unsupported/disabled operation: m
> 7 [main] INFO org.apache.pdfbox.util.PDFStreamEngine  - unsupported/disabled 
> operation: l
> INFO  [main]: unsupported/disabled operation: l
> 7 [main] INFO org.apache.pdfbox.util.PDFStreamEngine  - unsupported/disabled 
> operation: S
> INFO  [main]: unsupported/disabled operation: S
> 272 [main] INFO org.apache.pdfbox.util.PDFStreamEngine  - 
> unsupported/disabled operation: re
> INFO  [main]: unsupported/disabled operation: re
> 272 [main] INFO org.apache.pdfbox.util.PDFStreamEngine  - 
> unsupported/disabled operation: f
> INFO  [main]: unsupported/disabled operation: f
> 1274 [main] INFO org.apache.pdfbox.util.PDFStreamEngine  - 
> unsupported/disabled operation: rg
> INFO  [main]: unsupported/disabled operation: rg
> 1275 [main] INFO org.apache.pdfbox.util.PDFStreamEngine  - 
> unsupported/disabled operation: RG
> INFO  [main]: unsupported/disabled operation: RG
> 1536 [main] INFO org.apache.pdfbox.util.PDFStreamEngine  - 
> unsupported/disabled operation: f*
> INFO  [main]: unsupported/disabled operation: f*

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (PDFBOX-534) PDF file created with LaTeX is bad parsed

Reply via email to