[jira] [Commented] (TIKA-1468) Symbol character handling in WordExtractor

Moritz Dorka (JIRA) Sat, 15 Nov 2014 03:46:10 -0800

    [ 
https://issues.apache.org/jira/browse/TIKA-1468?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14213535#comment-14213535
 ]


Moritz Dorka commented on TIKA-1468:
------------------------------------

So here is a jUnit testcase which relies on the special handling of characters 
from the "Symbol" font. The Microsoft specs talk about a case where these 
"special characters" already come in their unicode representation (thus 
triggering the fallback in [^WordExtractor.patch]). However, I have no idea how 
to create a Word file that actually shows this behavior...

Regarding the location of logic: Does TIKA actually make use of POI's 
{{AbstractWordConverter}}?


> Symbol character handling in WordExtractor
> ------------------------------------------
>
>                 Key: TIKA-1468
>                 URL: https://issues.apache.org/jira/browse/TIKA-1468
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser
>    Affects Versions: 1.6
>            Reporter: Moritz Dorka
>            Priority: Minor
>         Attachments: WordExtractor.patch, WordParserTest.patch, 
> testWORD_specialcharacters.tar.bz2
>
>
> Attached is a patch to allow for proper handling of _symbol characters_ in 
> *.doc files (i.e. stuff which can be inserted via Insert->Symbol in Word).
> Side note: I am a little unsure where exactly the boundary between the scope 
> of TIKA and POI lies here. Theorectically one could add that patch to 
> {{org.apache.poi.hwpf.converter.AbstractWordConverter.processSymbol(HWPFDocument,
>  CharacterRun, Element)}} as well.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-1468) Symbol character handling in WordExtractor

Reply via email to