[
https://issues.apache.org/jira/browse/PDFBOX-534?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789786#action_12789786
]
Ernesto De Santis commented on PDFBOX-534:
------------------------------------------
I found the cause of the problem, but not the solution.
It's a bad font encoding detection, or an unsupported encoding.
Debugging the pdfbox classes I found in the lines that encode the characters,
when the character is wrong read. Look this lines:
Class PDFont, Method String encode( byte[] c, int offset, int length ), line
438.
438 Encoding encoding = getEncoding();
439 if( encoding != null)
440 {
441 retval = encoding.getCharacter( getCodeFromArray( c, offset,
length ) );
442 }
443 if( retval == null )
444 {
445 retval = getStringFromArray( c, offset, length );
446 }
The first line, method getEncoding() return a
org.apache.pdfbox.encoding.DictionaryEncoding, then go into the if (439), and
getCharacter method return a aXX character. The second if(443) is
disconsidered, but I evaluated the getStringFromArray method and it return a
beautiful normal character like 'i'.
Then I tried two ways, understand what is wrong with my font encoding and who
is generating it. My pdf is generated by a latex, and I found for European
accented character is used a package \usepackage[T1]{fontenc}, I'm using it. I
take off this line from my latex source file, and generate the pdf again. When
ran the pdfbox text again, I got a better result:
Implementando acceso a sistemas de
archivos virtuales para la herramienta
de b usqueda Kneobase
Alumno: Ernesto De Santis
Director: Pablo Ernesto Mart nez L opez
But WITHOUT the accented characters.
Then, I tried to use the getStringFromArray instead of encoding.getCharacter in
the pdfbox source, backing the latex source as the original one. I did it, but
the result was similar, bad accented characters:
Implementando acceso a sistemas de
archivos virtuales para la herramienta
de b�squeda Kneobase
Alumno: Ernesto De Santis
Director: Pablo Ernesto Mart�nez L�pez
> PDF file created with LaTeX is bad parsed
> -----------------------------------------
>
> Key: PDFBOX-534
> URL: https://issues.apache.org/jira/browse/PDFBOX-534
> Project: PDFBox
> Issue Type: Bug
> Components: Parsing
> Affects Versions: 0.8.0-incubator
> Environment: Linux/Ubuntu 9
> Reporter: Ernesto De Santis
> Attachments: kvfs.pdf, kvfs.txt
>
>
> I'm getting an unexpected behavior parsing a pdf file.
> I'm trying to get the clean body text of some file, and I get a lot of aXX
> strings. Where each X is a number. It appear be the char code of the real
> character, I don't know really.
> My code is too simple:
> String[] args = {"/home/ernesto/tesis/documento/kvfs.pdf"};
> ExtractText.main(args);
> I used the PDFBox 0.8.0-incubator version. Builded on 20/9/2009.
> The output I get is:
> a73a109a112a108a101a109a101a110a116a97a110a100a111 a97a99a99a101a115a111 a97
> a115a105a115a116a101a109a97a115 a100a101
> a97a114a99a104a105a118a111a115 a118a105a114a116a117a97a108a101a115
> a112a97a114a97 a108a97 a104a101a114a114a97a109a105a101a110a116a97
> a100a101 a98a250a115a113a117a101a100a97 a75a110a101a111a98a97a115a101
> and more ......
> The pdf file was generated by pdflatex command, in Ubuntu 9.
> The pdf properties are:
> producer: pdfTeX-1.40.3
> format: PDF-1.4
> security: NO
> optimized: NO
> paper: A4, vertical (210 x 297 mm)
> When I run the PDFBox test, I get this by the console:
> 0 [main] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled
> operation: d
> INFO [main]: unsupported/disabled operation: d
> 7 [main] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled
> operation: J
> INFO [main]: unsupported/disabled operation: J
> 7 [main] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled
> operation: m
> INFO [main]: unsupported/disabled operation: m
> 7 [main] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled
> operation: l
> INFO [main]: unsupported/disabled operation: l
> 7 [main] INFO org.apache.pdfbox.util.PDFStreamEngine - unsupported/disabled
> operation: S
> INFO [main]: unsupported/disabled operation: S
> 272 [main] INFO org.apache.pdfbox.util.PDFStreamEngine -
> unsupported/disabled operation: re
> INFO [main]: unsupported/disabled operation: re
> 272 [main] INFO org.apache.pdfbox.util.PDFStreamEngine -
> unsupported/disabled operation: f
> INFO [main]: unsupported/disabled operation: f
> 1274 [main] INFO org.apache.pdfbox.util.PDFStreamEngine -
> unsupported/disabled operation: rg
> INFO [main]: unsupported/disabled operation: rg
> 1275 [main] INFO org.apache.pdfbox.util.PDFStreamEngine -
> unsupported/disabled operation: RG
> INFO [main]: unsupported/disabled operation: RG
> 1536 [main] INFO org.apache.pdfbox.util.PDFStreamEngine -
> unsupported/disabled operation: f*
> INFO [main]: unsupported/disabled operation: f*
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.