[jira] [Resolved] (TIKA-322) Improve encoding detection speed and accuracy

Jukka Zitting (JIRA) Sat, 07 Jul 2012 12:46:36 -0700

     [ 
https://issues.apache.org/jira/browse/TIKA-322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Jukka Zitting resolved TIKA-322.
--------------------------------

       Resolution: Fixed
    Fix Version/s: 1.2
         Assignee: Jukka Zitting

I integrated the juniversalchardet library into TXTParser in revision 1358624. 
The encoding detection mechanism still falls back to the ICU4J code if 
juniversalchardet wasn't able to determine the character encoding, so the risk 
of regressions should be pretty low.
                
> Improve encoding detection speed and accuracy
> ---------------------------------------------
>
>                 Key: TIKA-322
>                 URL: https://issues.apache.org/jira/browse/TIKA-322
>             Project: Tika
>          Issue Type: Improvement
>          Components: mime
>            Reporter: Jukka Zitting
>            Assignee: Jukka Zitting
>            Priority: Minor
>             Fix For: 1.2
>
>
> The encoding detection code we took from ICU4J is not very efficient and 
> sometimes produces odd results when more than one encoding matches the given 
> input data. It would be good to refactor the code to be faster for 
> easy-to-detect encodings and to have better heuristics in case multiple 
> matches are found.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Resolved] (TIKA-322) Improve encoding detection speed and accuracy

Reply via email to