[ https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16706496#comment-16706496 ]
Shabanali Faghani commented on TIKA-2038: ----------------------------------------- {quote}UTF-8 currently makes up over 92% of the web. {quote} You've well understanding of the context. I've noticed that you've pointed to it in TIKA-2771, too. Hence I decided to provide further information about it based on the _results_ table in _AGGREGATED-RESULTS.db_ in _fine-grained_ folder of [^lang-wise-eval_results.zip]. {code:sql} WITH X AS (SELECT language, count(1) AS total_docs FROM RESULTS GROUP BY language) SELECT R.language AS Language, count(1) AS Total_Docs, round(100 * count(CASE WHEN httpcharset = 'UTF-8' THEN 1 END ) / X.total_docs, 2) AS UTF_8, round(100 * count(CASE WHEN httpcharset = 'ISO-8859-1' THEN 1 END ) / X.total_docs, 2) AS ISO_8859_1, round(100 * count(CASE WHEN httpcharset = 'Windows-1256' THEN 1 END ) / X.total_docs, 2) AS Windows_1256, round(100 * count(CASE WHEN httpcharset = 'Windows-1252' THEN 1 END ) / X.total_docs, 2) AS Windows_1252, round(100 * count(CASE WHEN httpcharset = 'ISO-8859-15' THEN 1 END ) / X.total_docs, 2) AS ISO_8859_15, round(100 * count(CASE WHEN httpcharset = 'GB2312' THEN 1 END ) / X.total_docs, 2) AS GB2312, round(100 * count(CASE WHEN httpcharset = 'EUC-KR' THEN 1 END ) / X.total_docs, 2) AS EUC_KR, round(100 * count(CASE WHEN httpcharset = 'ISO-8859-9' THEN 1 END ) / X.total_docs, 2) AS ISO_8859_9, round(100 * count(CASE WHEN httpcharset = 'GBK' THEN 1 END ) / X.total_docs, 2) AS GBK, round(100 * count(CASE WHEN httpcharset = 'GB18030' THEN 1 END ) / X.total_docs, 2) AS GB18030, round(100 * count(CASE WHEN httpcharset = 'EUC-JP' THEN 1 END ) / X.total_docs, 2) AS EUC_JP, round(100 * count(CASE WHEN httpcharset = 'Shift_JIS' THEN 1 END ) / X.total_docs, 2) AS Shift_JIS, round(100 * count(CASE WHEN httpcharset = 'ISO-2022-JP' THEN 1 END ) / X.total_docs, 2) AS ISO_2022_JP, round(100 * count(CASE WHEN httpcharset = 'US-ASCII' THEN 1 END ) / X.total_docs, 2) AS US_ASCII, round(100 * count(CASE WHEN httpcharset = 'ISO-8859-2' THEN 1 END ) / X.total_docs, 2) AS ISO_8859_2, round(100 * count(CASE WHEN httpcharset = 'Windows-1251' THEN 1 END ) / X.total_docs, 2) AS Windows_1251, round(100 * count(CASE WHEN httpcharset = 'KOI8-R' THEN 1 END ) / X.total_docs, 2) AS KOI8_R FROM RESULTS R INNER JOIN X ON X.language = R.language GROUP BY R.language UNION ALL SELECT 'X-ALL', count(1), round(100.0 * count(CASE WHEN httpcharset = 'UTF-8' THEN 1 END ) / count(1), 2) AS UTF_8, round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-1' THEN 1 END ) / count(1), 2) AS ISO_8859_1, round(100.0 * count(CASE WHEN httpcharset = 'Windows-1256' THEN 1 END ) / count(1), 2) AS Windows_1256, round(100.0 * count(CASE WHEN httpcharset = 'Windows-1252' THEN 1 END ) / count(1), 2) AS Windows_1252, round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-15' THEN 1 END ) / count(1), 2) AS ISO_8859_15, round(100.0 * count(CASE WHEN httpcharset = 'GB2312' THEN 1 END ) / count(1), 2) AS GB2312, round(100.0 * count(CASE WHEN httpcharset = 'EUC-KR' THEN 1 END ) / count(1), 2) AS EUC_KR, round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-9' THEN 1 END ) / count(1), 2) AS ISO_8859_9, round(100.0 * count(CASE WHEN httpcharset = 'GBK' THEN 1 END ) / count(1), 2) AS GBK, round(100.0 * count(CASE WHEN httpcharset = 'GB18030' THEN 1 END ) / count(1), 2) AS GB18030, round(100.0 * count(CASE WHEN httpcharset = 'EUC-JP' THEN 1 END ) / count(1), 2) AS EUC_JP, round(100.0 * count(CASE WHEN httpcharset = 'Shift_JIS' THEN 1 END ) / count(1), 2) AS Shift_JIS, round(100.0 * count(CASE WHEN httpcharset = 'ISO-2022-JP' THEN 1 END ) / count(1), 2) AS ISO_2022_JP, round(100.0 * count(CASE WHEN httpcharset = 'US-ASCII' THEN 1 END ) / count(1), 2) AS US_ASCII, round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-2' THEN 1 END ) / count(1), 2) AS ISO_8859_2, round(100.0 * count(CASE WHEN httpcharset = 'Windows-1251' THEN 1 END ) / count(1), 2) AS Windows_1251, round(100.0 * count(CASE WHEN httpcharset = 'KOI8-R' THEN 1 END ) / count(1), 2) AS KOI8_R FROM RESULTS; -- without pivot table the row as column queries tend to become verbose! {code} ||Language||Total_Docs||UTF_8||ISO_8859_1||Windows_1256||Windows_1252||ISO_8859_15||GB2312||EUC_KR||ISO_8859_9||GBK||GB18030||EUC_JP||Shift_JIS||ISO_2022_JP||US_ASCII||ISO_8859_2||Windows_1251||KOI8_R|| ||Arabic|1168|95|1|2|0|0|0|0|0|0|0|0|0|0|0|0|0|0| ||Chinese|3860|85|0|0|0|0|6|0|0|6|0|0|0|0|0|0|0|0| ||English|13582|96|3|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0| ||French|8712|93|4|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0| ||German|24884|91|7|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0| ||Indian|6158|97|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0| ||Italian|7371|94|4|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0| ||Japanese|7736|89|0|0|0|0|0|0|0|0|0|3|5|0|0|0|0|0| ||Korean|735|78|0|0|0|0|0|20|0|0|0|0|0|0|0|0|0|0| ||Persian|4003|99|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0| ||Russian|28229|80|0|0|0|0|0|0|0|0|0|0|0|0|0|0|19|0| ||Spanish|6294|93|5|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0| ||Turkish|1561|95|0|0|0|0|0|0|4|0|0|0|0|0|0|0|0|0| ||Vietnamese|1463|99|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0| ||X-ALL|115756|90.19|3.1|0.03|0.14|0.41|0.23|0.13|0.06|0.23|0|0.26|0.39|0|0.01|0.01|4.75|0.06| This table will might be useful to decide if the _stripping_ should be default when we made it configurable. {quote}This probably wasn't the case even 5 years ago {quote} I don't agree if it implies that without widespread use of UTF-8, IUST won't outperform other algorithms. Because even if the share of UTF-8 is 50% IUST will still outperform other algorithms. I developed it 6 years ago! :) > A more accurate facility for detecting Charset Encoding of HTML documents > ------------------------------------------------------------------------- > > Key: TIKA-2038 > URL: https://issues.apache.org/jira/browse/TIKA-2038 > Project: Tika > Issue Type: Improvement > Components: core, detector > Reporter: Shabanali Faghani > Priority: Minor > Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, > iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, > lang-wise-eval_source_code.zip, proposedTLDSampling.csv, > tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html.xlsx, > tld_text_html_plus_H_column.xlsx > > > Currently, Tika uses icu4j for detecting charset encoding of HTML documents > as well as the other naturally text documents. But the accuracy of encoding > detector tools, including icu4j, in dealing with the HTML documents is > meaningfully less than from which the other text documents. Hence, in our > project I developed a library that works pretty well for HTML documents, > which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet > Since Tika is widely used with and within some of other Apache stuffs such as > Nutch, Lucene, Solr, etc. and these projects are strongly in connection with > the HTML documents, it seems that having such an facility in Tika also will > help them to become more accurate. -- This message was sent by Atlassian JIRA (v7.6.3#76005)