[
https://issues.apache.org/jira/browse/PDFBOX-949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12986565#action_12986565
]
Andreas Lehmkühler commented on PDFBOX-949:
-------------------------------------------
I extracted the text using the current trunk version (see attachment). There
are some issues concerning the mathematical formulars and the text within the
diagrams, but the text itself looks quite good.
> ExtractText returns junk
> ------------------------
>
> Key: PDFBOX-949
> URL: https://issues.apache.org/jira/browse/PDFBOX-949
> Project: PDFBox
> Issue Type: Bug
> Components: Text extraction
> Affects Versions: 1.4.0
> Environment: Ubuntu Linux 10.10, Sun Java 1.6.0_22
> Reporter: Nikhil Chhaochharia
> Priority: Minor
> Fix For: 1.5.0
>
> Attachments: PDFBOX945-NIPS2010_0566.pdf, PDFBOX945-NIPS2010_0566.txt
>
>
> The PDF file at http://books.nips.cc/papers/files/nips23/NIPS2010_0566.pdf
> returns some weird characters given below. No exceptions are thrown.
> The command used was "java -jar pdfbox-app-1.4.0.jar ExtractText -sort
> -console NIPS2010_0566.pdf"
> 1 1 1 1
> '—;˜: :'¸s ; s :; s˜ :
> h ` s ˆ ; s ;:s ¸ˆ:
> h ` s , s —
> [ ' : o[':p t
> u ˜
> s s
> u t
> t
> u `
> [': u
> 6
> [ ' : fi
> u — s
> u ' s u
> ˜ [': u ˜
> u
> — s s
> s s
> u ˜ u / s
> - - s s s s s
> u ˆ s
> s s
> t 1 u / s
> s o
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.