[
https://issues.apache.org/jira/browse/TIKA-2838?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16792001#comment-16792001
]
Karl Wright commented on TIKA-2838:
-----------------------------------
Not sure the <b>/</b> tags are needed, but there is certainly a need to
separate text chunks from different sources. What's happening now is this:
{code}
What Is Apache ManifoldCFFZFarago ZoltanWhen a comment is extracted, there is
no separator.?
{code}
... and that makes it impossible to tokenize.
> RTF document processing glues comment fields together with text without
> whitespace
> ----------------------------------------------------------------------------------
>
> Key: TIKA-2838
> URL: https://issues.apache.org/jira/browse/TIKA-2838
> Project: Tika
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.17, 1.19
> Reporter: Karl Wright
> Priority: Major
>
> See ManifoldCF ticket CONNECTORS-1591 for a sample document and a description
> of the problem. Basically, comment fields for RTF documents are glued
> together with no whitespace between them, while other document formats
> properly put in a space (e.g. .docx etc).
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)