[
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15399911#comment-15399911
]
Tim Allison edited comment on TIKA-2038 at 7/29/16 7:48 PM:
------------------------------------------------------------
||Subdirectory||Detected by Tika||Count||Percent||
|GBK| GBK |323| 77.1%
|GBK| GB2312| 77| |
|GBK| GB18030| 13| |
|GBK| UTF-8| 3| |
|GBK| windows-1252| 3| |
|Shift_JIS| Shift_JIS| 639| 99.8%|
|Shift_JIS| windows-1252| 1| |
|UTF-8| UTF-8| 642| 97.7%|
|UTF-8| ISO-8859-1| 11| |
|UTF-8| windows-1252| 4| |
|Windows-1251| windows-1251| 313| 99.7%|
|Windows-1251| UTF-8| 1| |
|Windows-1256| windows-1256| 597| 92.6%|
|Windows-1256| windows-1252| 24 | |
|Windows-1256| ISO-8859-1| 10 | |
|Windows-1256| UTF-8| 7 | |
|Windows-1256| x-MacCyrillic| 5| |
|Windows-1256| IBM866| 1 | |
|Windows-1256| ISO-8859-5| 1| |
was (Author: [email protected]):
||Subdirectory||Detected by Tika||Count||Percent||
|GBK| GBK |323| 77.1%
|GBK| GB2312| 77| |
|GBK| GB18030| 13| |
|GBK| UTF-8| 3| |
|GBK| windows-1252| 3| |
|Shift_JIS| Shift_JIS| 639| 99.8%|
|Shift_JIS| windows-1252| 1| |
|UTF-8| UTF-8| 642| 97.7%|
|UTF-8| ISO-8859-1| 11| |
|UTF-8| windows-1252| 4| |
|Windows-1251| windows-1251| 313| 99.7%|
|Windows-1251| UTF-8| 1| |
> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
> Issue Type: Improvement
> Components: core, detector
> Reporter: Shabanali Faghani
> Priority: Minor
> Attachments: tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents
> as well as the other naturally text documents. But the accuracy of encoding
> detector tools, including icu4j, in dealing with the HTML documents is
> meaningfully less than from which the other text documents. Hence, in our
> project I developed a library that works pretty well for HTML documents,
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with
> the HTML documents, it seems that having such an facility in Tika also will
> help them to become more accurate.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)