[
https://issues.apache.org/jira/browse/TIKA-765?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Tyler Palsulich closed TIKA-765.
--------------------------------
Resolution: Won't Fix
Closing as Won't Fix since the Persian character issues seem to be solved.
> add icu dependency
> ------------------
>
> Key: TIKA-765
> URL: https://issues.apache.org/jira/browse/TIKA-765
> Project: Tika
> Issue Type: Improvement
> Components: general
> Affects Versions: 0.10
> Reporter: Robert Muir
>
> Spinoff of TIKA-713.
> In PDFBox, reflection is used to detect if ICU is available in the classpath:
> if it is, then it can use ICU BiDi support
> to properly extract right-to-left text. otherwise, the text is returned
> "backwards". This is because the JDK does not
> provide the functionality needed to do this inverse BiDI reordering /
> arabic-unshaping.
> it would be nice to properly depend on this, so that these languages work out
> of box... we do this in Apache Solr's
> tika integration (contrib/extraction) for example.
> Unlike the charset detection code from ICU that tika "includes", including
> BiDi support would be trickier, because it uses
> datafiles built from unicode (These change over time and would be a hassle to
> maintain).
> Additionally as a note: Tika has some forked charset code from ICU... long
> term it would be great to get those changes
> into ICU as well.
> Finally as an optimization its possible to reduce the icu4j jar size if
> needed with http://apps.icu-project.org/datacustom/,
> but maybe as a start we could just depend upon the 'whole' icu?
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)