Shai Erera created TIKA-1131:
--------------------------------

             Summary: Output sentence-break "hints" for files such as PPT/X
                 Key: TIKA-1131
                 URL: https://issues.apache.org/jira/browse/TIKA-1131
             Project: Tika
          Issue Type: Improvement
          Components: parser
            Reporter: Shai Erera
            Priority: Minor


Spinoff from here: http://tika.markmail.org/thread/xk5sclapbeonifzr. I believe 
that usually these files contain text that does not end with the usual sentence 
breaks. As I've shown in the email, the parser seems to detect e.g. different 
bullets by inserting manual '\n' characters, but that's not enough per the 
sentence segmentation rules of UAX#29.

It would be better if the parser output a clearer marker which the user could 
then replace with a true sentence break (e.g. \u2029), rather than arbitrarily 
replacing every '\n', which I think is not a good general solution.

BTW, I parsed Impress files and it seems the parser does output some hints (I 
think <p> tags).

I'll upload an isolated test which generates the output as I put in the email.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to