[ 
https://issues.apache.org/jira/browse/PDFBOX-568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andreas Lehmkühler resolved PDFBOX-568.
---------------------------------------

    Fix Version/s: 1.3.0
       Resolution: Fixed

Version 992066 fixes the text extraction issue with 
sample_fonts_solidconvertor.pdf and cweb.pdf from our test arena.

To achieve that I rearranged/improved the code concerning the encoding. The 
next step will hopefully be adding support for CID coded fonts

> testextract failure on Linux and Mac OS X
> -----------------------------------------
>
>                 Key: PDFBOX-568
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-568
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 0.8.0-incubator
>            Reporter: Jukka Zitting
>             Fix For: 1.3.0
>
>
> As discussed on the mailing list, the extraction test case seems to fail on 
> non-Windows platforms.
> The troublesome test file is ample_fonts_solidconvertor.pdf, and the 
> textextract.log file says the following (^@ is U+0000 and � is U+FFFD):
> Lines differ at index expected:46-253 actual:46-65533
> FAILURE: Line mismatch for file sample_fonts_solidconvertor.pdf at expected 
> line: 8 at actual line: 8
>   expected line was: "^...@v^@e...@r^@d...@a^@n...@a^@:^@ ^...@t^@o...@t^@o^@ 
> ^...@j^@e^@ ^...@p^@o...@k^@u...@s^@n...@ý^@ ^...@t^@e...@x^@t^@ ^...@s^@ ^A"
>   actual line was:   "^...@v^@e...@r^@d...@a^@n...@a^@:^@ ^...@t^@o...@t^@o^@ 
> ^...@j^@e^@ ^...@p^@o...@k^@u...@s^@n...@�^@ ^...@t^@e...@x^@t^@ ^...@s^@ ^A"
> Lines differ at index expected:4-253 actual:4-65533
> FAILURE: Line mismatch for file sample_fonts_solidconvertor.pdf at expected 
> line: 10 at actual line: 10
>   expected line was: "^ay^...@ý^@�...@í^@é"
>   actual line was:   "^ay^...@�^@�...@�^@�"
> Lines differ at index expected:52-253 actual:52-65533
> FAILURE: Line mismatch for file sample_fonts_solidconvertor.pdf at expected 
> line: 11 at actual line: 11
>   expected line was: "^...@s^@a...@n^@s^@ ^...@s^@e...@r^@i...@f^@:^@ 
> ^...@t^@o...@t^@o^@ ^...@j^@e^@ ^...@p^@o...@k^@u...@s^@n...@ý^@ 
> ^...@t^@e...@x^@t^@ ^...@s^@ ^A"
>   actual line was:   "^...@s^@a...@n^@s^@ ^...@s^@e...@r^@i...@f^@:^@ 
> ^...@t^@o...@t^@o^@ ^...@j^@e^@ ^...@p^@o...@k^@u...@s^@n...@�^@ 
> ^...@t^@e...@x^@t^@ ^...@s^@ ^A"
> Lines differ at index expected:4-253 actual:4-65533
> FAILURE: Line mismatch for file sample_fonts_solidconvertor.pdf at expected 
> line: 13 at actual line: 13
>   expected line was: "^ay^...@ý^@�...@í^@é"
>   actual line was:   "^ay^...@�^@�...@�^@�"
> Preparing to parse sample_fonts_solidconvertor.pdf for sorted test
> Lines differ at index expected:46-253 actual:46-65533
> FAILURE: Line mismatch for file sample_fonts_solidconvertor.pdf at expected 
> line: 8 at actual line: 8
>   expected line was: "^...@v^@e...@r^@d...@a^@n...@a^@:^@ ^...@t^@o...@t^@o^@ 
> ^...@j^@e^@ ^...@p^@o...@k^@u...@s^@n...@ý^@ ^...@t^@e...@x^@t^@ ^...@s^@ ^A"
>   actual line was:   "^...@v^@e...@r^@d...@a^@n...@a^@:^@ ^...@t^@o...@t^@o^@ 
> ^...@j^@e^@ ^...@p^@o...@k^@u...@s^@n...@�^@ ^...@t^@e...@x^@t^@ ^...@s^@ ^A"
> Lines differ at index expected:0-253 actual:0-65533
> FAILURE: Line mismatch for file sample_fonts_solidconvertor.pdf at expected 
> line: 10 at actual line: 10
>   expected line was: "^...@ý^@�...@í^@é"
>   actual line was:   "^...@�^@�...@�^@�"
> Lines differ at index expected:52-253 actual:52-65533
> FAILURE: Line mismatch for file sample_fonts_solidconvertor.pdf at expected 
> line: 11 at actual line: 11
>   expected line was: "^...@s^@a...@n^@s^@ ^...@s^@e...@r^@i...@f^@:^@ 
> ^...@t^@o...@t^@o^@ ^...@j^@e^@ ^...@p^@o...@k^@u...@s^@n...@ý^@ 
> ^...@t^@e...@x^@t^@ ^...@s^@ ^A"
>   actual line was:   "^...@s^@a...@n^@s^@ ^...@s^@e...@r^@i...@f^@:^@ 
> ^...@t^@o...@t^@o^@ ^...@j^@e^@ ^...@p^@o...@k^@u...@s^@n...@�^@ 
> ^...@t^@e...@x^@t^@ ^...@s^@ ^A"
> Lines differ at index expected:4-253 actual:4-65533
> FAILURE: Line mismatch for file sample_fonts_solidconvertor.pdf at expected 
> line: 13 at actual line: 13
>   expected line was: "^a~^...@ý^@�...@í^@é"
>   actual line was:   "^a~^...@�^@�...@�^@�"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to