[
https://issues.apache.org/jira/browse/TIKA-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375439#comment-14375439
]
Shai Erera commented on TIKA-1131:
----------------------------------
Hi [~tpalsulich] thanks for getting back to me, but I've since then replaced my
laptop and I don't have that sample file anymore. I can close the issue for now
and if I'll run into it again I'll report back. OK?
> Output sentence-break "hints" for files such as PPT/X
> -----------------------------------------------------
>
> Key: TIKA-1131
> URL: https://issues.apache.org/jira/browse/TIKA-1131
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Shai Erera
> Priority: Minor
>
> Spinoff from here: http://tika.markmail.org/thread/xk5sclapbeonifzr. I
> believe that usually these files contain text that does not end with the
> usual sentence breaks. As I've shown in the email, the parser seems to detect
> e.g. different bullets by inserting manual '\n' characters, but that's not
> enough per the sentence segmentation rules of UAX#29.
> It would be better if the parser output a clearer marker which the user could
> then replace with a true sentence break (e.g. \u2029), rather than
> arbitrarily replacing every '\n', which I think is not a good general
> solution.
> BTW, I parsed Impress files and it seems the parser does output some hints (I
> think <p> tags).
> I'll upload an isolated test which generates the output as I put in the email.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)