[jira] [Commented] (TIKA-711) Word parser doesn't extract optional hyphen correctly

Michael McCandless (Commented) (JIRA) Sat, 01 Oct 2011 13:52:58 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-711?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13118905#comment-13118905
 ]


Michael McCandless commented on TIKA-711:
-----------------------------------------

Curiously, if I use POI's WordToTextConverter command-line tool, it produces 
U+200b (ZERO WIDTH SPACE) for the optional hyphen, which I think is at least 
better than ASCII 31.  Still not sure if there's a POI option we can set to get 
this character out as U+00AD.
                
> Word parser doesn't extract optional hyphen correctly
> -----------------------------------------------------
>
>                 Key: TIKA-711
>                 URL: https://issues.apache.org/jira/browse/TIKA-711
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>            Reporter: Michael McCandless
>         Attachments: TIKA-711.patch, testOptionalHyphen.doc, 
> testOptionalHyphen.docx, testOptionalHyphen.pdf, testOptionalHyphen.ppt, 
> testOptionalHyphen.pptx, testOptionalHyphen.rtf
>
>
> We seem not to extract the optional hyphen character correctly in
> the Word parser.
> You can create this char in Word by typing ctrl and -.  It's hidden,
> normally; you have to turn on display of formatting marks to see it.
> Ideally we'd get U+00AD (unicode soft hyphen), I think.
> DOC produces a unicode replacement char, which is wrong.
> DOCX and PDF drop the char (which seems acceptable).  RTF produces
> U+2027 (hyphenation point) which also seems OK (in TIKA-683 it will
> produce U+00AD).
> PPT and PPTX work correctly (U+00AD).
> So DOC is the only bug I think -- I haven't dug into what's wrong
> yet...

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (TIKA-711) Word parser doesn't extract optional hyphen correctly

Reply via email to