[ 
https://issues.apache.org/jira/browse/TIKA-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14362109#comment-14362109
 ] 

Tyler Palsulich commented on TIKA-1131:
---------------------------------------

Hi [~shaie]. Sorry no one responded to this! Can you upload a file with the 
bullets (and *s) you described in your email? Thanks!

> Output sentence-break "hints" for files such as PPT/X
> -----------------------------------------------------
>
>                 Key: TIKA-1131
>                 URL: https://issues.apache.org/jira/browse/TIKA-1131
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Shai Erera
>            Priority: Minor
>
> Spinoff from here: http://tika.markmail.org/thread/xk5sclapbeonifzr. I 
> believe that usually these files contain text that does not end with the 
> usual sentence breaks. As I've shown in the email, the parser seems to detect 
> e.g. different bullets by inserting manual '\n' characters, but that's not 
> enough per the sentence segmentation rules of UAX#29.
> It would be better if the parser output a clearer marker which the user could 
> then replace with a true sentence break (e.g. \u2029), rather than 
> arbitrarily replacing every '\n', which I think is not a good general 
> solution.
> BTW, I parsed Impress files and it seems the parser does output some hints (I 
> think <p> tags).
> I'll upload an isolated test which generates the output as I put in the email.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to