[
https://issues.apache.org/jira/browse/TIKA-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362109#comment-14362109
]
Tyler Palsulich commented on TIKA-1131:
---------------------------------------
Hi [~shaie]. Sorry no one responded to this! Can you upload a file with the
bullets (and *s) you described in your email? Thanks!
> Output sentence-break "hints" for files such as PPT/X
> -----------------------------------------------------
>
> Key: TIKA-1131
> URL: https://issues.apache.org/jira/browse/TIKA-1131
> Project: Tika
> Issue Type: Improvement
> Components: parser
> Reporter: Shai Erera
> Priority: Minor
>
> Spinoff from here: http://tika.markmail.org/thread/xk5sclapbeonifzr. I
> believe that usually these files contain text that does not end with the
> usual sentence breaks. As I've shown in the email, the parser seems to detect
> e.g. different bullets by inserting manual '\n' characters, but that's not
> enough per the sentence segmentation rules of UAX#29.
> It would be better if the parser output a clearer marker which the user could
> then replace with a true sentence break (e.g. \u2029), rather than
> arbitrarily replacing every '\n', which I think is not a good general
> solution.
> BTW, I parsed Impress files and it seems the parser does output some hints (I
> think <p> tags).
> I'll upload an isolated test which generates the output as I put in the email.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)