[ 
https://issues.apache.org/jira/browse/TIKA-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tyler Palsulich closed TIKA-765.
--------------------------------
    Resolution: Won't Fix

Closing as Won't Fix since the Persian character issues seem to be solved.

> add icu dependency
> ------------------
>
>                 Key: TIKA-765
>                 URL: https://issues.apache.org/jira/browse/TIKA-765
>             Project: Tika
>          Issue Type: Improvement
>          Components: general
>    Affects Versions: 0.10
>            Reporter: Robert Muir
>
> Spinoff of TIKA-713.
> In PDFBox, reflection is used to detect if ICU is available in the classpath: 
> if it is, then it can use ICU BiDi support
> to properly extract right-to-left text. otherwise, the text is returned 
> "backwards". This is because the JDK does not
> provide the functionality needed to do this inverse BiDI reordering / 
> arabic-unshaping.
> it would be nice to properly depend on this, so that these languages work out 
> of box... we do this in Apache Solr's
> tika integration (contrib/extraction) for example.
> Unlike the charset detection code from ICU that tika "includes", including 
> BiDi support would be trickier, because it uses
> datafiles built from unicode (These change over time and would be a hassle to 
> maintain).
> Additionally as a note: Tika has some forked charset code from ICU... long 
> term it would be great to get those changes 
> into ICU as well.
> Finally as an optimization its possible to reduce the icu4j jar size if 
> needed with http://apps.icu-project.org/datacustom/,
> but maybe as a start we could just depend upon the 'whole' icu?



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to