Shai Erera created TIKA-1131:
--------------------------------
Summary: Output sentence-break "hints" for files such as PPT/X
Key: TIKA-1131
URL: https://issues.apache.org/jira/browse/TIKA-1131
Project: Tika
Issue Type: Improvement
Components: parser
Reporter: Shai Erera
Priority: Minor
Spinoff from here: http://tika.markmail.org/thread/xk5sclapbeonifzr. I believe
that usually these files contain text that does not end with the usual sentence
breaks. As I've shown in the email, the parser seems to detect e.g. different
bullets by inserting manual '\n' characters, but that's not enough per the
sentence segmentation rules of UAX#29.
It would be better if the parser output a clearer marker which the user could
then replace with a true sentence break (e.g. \u2029), rather than arbitrarily
replacing every '\n', which I think is not a good general solution.
BTW, I parsed Impress files and it seems the parser does output some hints (I
think <p> tags).
I'll upload an isolated test which generates the output as I put in the email.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira