[jira] [Commented] (TIKA-1131) Output sentence-break "hints" for files such as PPT/X

Shai Erera (JIRA) Sun, 22 Mar 2015 23:02:33 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-1131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14375439#comment-14375439
 ]


Shai Erera commented on TIKA-1131:
----------------------------------

Hi [~tpalsulich] thanks for getting back to me, but I've since then replaced my 
laptop and I don't have that sample file anymore. I can close the issue for now 
and if I'll run into it again I'll report back. OK?

> Output sentence-break "hints" for files such as PPT/X
> -----------------------------------------------------
>
>                 Key: TIKA-1131
>                 URL: https://issues.apache.org/jira/browse/TIKA-1131
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>            Reporter: Shai Erera
>            Priority: Minor
>
> Spinoff from here: http://tika.markmail.org/thread/xk5sclapbeonifzr. I 
> believe that usually these files contain text that does not end with the 
> usual sentence breaks. As I've shown in the email, the parser seems to detect 
> e.g. different bullets by inserting manual '\n' characters, but that's not 
> enough per the sentence segmentation rules of UAX#29.
> It would be better if the parser output a clearer marker which the user could 
> then replace with a true sentence break (e.g. \u2029), rather than 
> arbitrarily replacing every '\n', which I think is not a good general 
> solution.
> BTW, I parsed Impress files and it seems the parser does output some hints (I 
> think <p> tags).
> I'll upload an isolated test which generates the output as I put in the email.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1131) Output sentence-break "hints" for files such as PPT/X

Reply via email to