[ 
https://issues.apache.org/jira/browse/PDFBOX-861?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12921524#action_12921524
 ] 

Reinhard Schwab commented on PDFBOX-861:
----------------------------------------

yes, i can confirm,  this seems to be fixed. i dont have the extra spaces now 
in my test case.

in regard to the umlaute,
there are other special unicode characters also in the text.
not only to indicate umlaute.
but also to indicate list items.

example:
 je/KOUS schoner die Spatzen singen, desto/KON spater ist es.9
 je/KOUS spater der Abend, um/APPR so/ADV schoner die Gaste.
 je/KOUS spater der Abend, umso/KON schoner die Gaste.

does this need a special mapping?

some of them i dont understand now

Der Begri\u000B  Wortform

in Zi\u000Bern, Satzzeichen


> german umlaute are not recognized
> ---------------------------------
>
>                 Key: PDFBOX-861
>                 URL: https://issues.apache.org/jira/browse/PDFBOX-861
>             Project: PDFBox
>          Issue Type: Bug
>          Components: Text extraction
>    Affects Versions: 1.3.0
>         Environment: tika-0.8
>            Reporter: Reinhard Schwab
>         Attachments: stts-guide.pdf
>
>
> german umlaute are not recognized in this document
> http://www.computing.dcu.ie/~irehbein/SS08/uebung1/stts-guide.pdf
> Guidelines f
> 
> ur das Tagging deutscher Textcorpora

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to