[ 
https://issues.apache.org/jira/browse/TIKA-392?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12930726#action_12930726
 ] 

Thiago Souza commented on TIKA-392:
-----------------------------------

This extra space is being added in case of words with accents since the 
insertString method is invoked for each letter with accent.

For example for the phrase inside RTF (in portuguese):

       "GOVERNO DO ESTADO DO ESPÍRITO SANTO"

Will be extracted to:

        "GOVERNO DO ESTADO DO ESP Í RITO SANTO"

Since insertString is invoked with: "GOVERNO DO ESTADO DO ESP", "Í" and "RITO 
SANTO".

I just don't know if this is a problem with RTFEditorKit or RTFParser.

Any workaround?

> RTF parser smashes words together in subsequent table cells
> -----------------------------------------------------------
>
>                 Key: TIKA-392
>                 URL: https://issues.apache.org/jira/browse/TIKA-392
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 0.7
>
>
> I have an RTF document with the following snippet of content (it's an export 
> of a private phone book so I can't share the full document):
> {\rtlch\fcs1 \af0\afs24 \ltrch\fcs0 
> \f0\fs24\lang2055\langfe2055\langfenp2055\insrsid9461491\charrsid9461491 Fax 
> / Phone Station\cell Fax / Phone #\cell }
> The extracted text is:
> Fax / Phone StationFax / Phone
> Note how the cell boundary between "Station" and "Fax" is lost.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to