[
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16706496#comment-16706496
]
Shabanali Faghani edited comment on TIKA-2038 at 12/3/18 6:44 AM:
------------------------------------------------------------------
{quote}UTF-8 currently makes up over 92% of the web.
{quote}
You've well understanding of the context. I've noticed that you've pointed to
it in TIKA-2771, too. Hence I decided to provide further information about it
based on the _results_ table in _AGGREGATED-RESULTS.db_ in _fine-grained_
folder of [^lang-wise-eval_results.zip].
{code:sql}
SELECT
language AS Language, count(1) AS Total_Docs,
round(100.0 * count(CASE WHEN httpcharset = 'UTF-8' THEN 1 END ) / count(1),
2) AS UTF_8,
round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-1' THEN 1 END ) /
count(1), 2) AS ISO_8859_1,
round(100.0 * count(CASE WHEN httpcharset = 'Windows-1256' THEN 1 END ) /
count(1), 2) AS Windows_1256,
round(100.0 * count(CASE WHEN httpcharset = 'Windows-1252' THEN 1 END ) /
count(1), 2) AS Windows_1252,
round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-15' THEN 1 END ) /
count(1), 2) AS ISO_8859_15,
round(100.0 * count(CASE WHEN httpcharset = 'GB2312' THEN 1 END ) / count(1),
2) AS GB2312,
round(100.0 * count(CASE WHEN httpcharset = 'EUC-KR' THEN 1 END ) / count(1),
2) AS EUC_KR,
round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-9' THEN 1 END ) /
count(1), 2) AS ISO_8859_9,
round(100.0 * count(CASE WHEN httpcharset = 'GBK' THEN 1 END ) / count(1), 2)
AS GBK,
round(100.0 * count(CASE WHEN httpcharset = 'GB18030' THEN 1 END ) /
count(1), 2) AS GB18030,
round(100.0 * count(CASE WHEN httpcharset = 'EUC-JP' THEN 1 END ) / count(1),
2) AS EUC_JP,
round(100.0 * count(CASE WHEN httpcharset = 'Shift_JIS' THEN 1 END ) /
count(1), 2) AS Shift_JIS,
round(100.0 * count(CASE WHEN httpcharset = 'ISO-2022-JP' THEN 1 END ) /
count(1), 2) AS ISO_2022_JP,
round(100.0 * count(CASE WHEN httpcharset = 'US-ASCII' THEN 1 END ) /
count(1), 2) AS US_ASCII,
round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-2' THEN 1 END ) /
count(1), 2) AS ISO_8859_2,
round(100.0 * count(CASE WHEN httpcharset = 'Windows-1251' THEN 1 END ) /
count(1), 2) AS Windows_1251,
round(100.0 * count(CASE WHEN httpcharset = 'KOI8-R' THEN 1 END ) / count(1),
2) AS KOI8_R
FROM RESULTS
GROUP BY language
UNION ALL
SELECT
'X-ALL', count(1),
round(100.0 * count(CASE WHEN httpcharset = 'UTF-8' THEN 1 END ) / count(1),
2) AS UTF_8,
round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-1' THEN 1 END ) /
count(1), 2) AS ISO_8859_1,
round(100.0 * count(CASE WHEN httpcharset = 'Windows-1256' THEN 1 END ) /
count(1), 2) AS Windows_1256,
round(100.0 * count(CASE WHEN httpcharset = 'Windows-1252' THEN 1 END ) /
count(1), 2) AS Windows_1252,
round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-15' THEN 1 END ) /
count(1), 2) AS ISO_8859_15,
round(100.0 * count(CASE WHEN httpcharset = 'GB2312' THEN 1 END ) / count(1),
2) AS GB2312,
round(100.0 * count(CASE WHEN httpcharset = 'EUC-KR' THEN 1 END ) / count(1),
2) AS EUC_KR,
round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-9' THEN 1 END ) /
count(1), 2) AS ISO_8859_9,
round(100.0 * count(CASE WHEN httpcharset = 'GBK' THEN 1 END ) / count(1), 2)
AS GBK,
round(100.0 * count(CASE WHEN httpcharset = 'GB18030' THEN 1 END ) /
count(1), 2) AS GB18030,
round(100.0 * count(CASE WHEN httpcharset = 'EUC-JP' THEN 1 END ) / count(1),
2) AS EUC_JP,
round(100.0 * count(CASE WHEN httpcharset = 'Shift_JIS' THEN 1 END ) /
count(1), 2) AS Shift_JIS,
round(100.0 * count(CASE WHEN httpcharset = 'ISO-2022-JP' THEN 1 END ) /
count(1), 2) AS ISO_2022_JP,
round(100.0 * count(CASE WHEN httpcharset = 'US-ASCII' THEN 1 END ) /
count(1), 2) AS US_ASCII,
round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-2' THEN 1 END ) /
count(1), 2) AS ISO_8859_2,
round(100.0 * count(CASE WHEN httpcharset = 'Windows-1251' THEN 1 END ) /
count(1), 2) AS Windows_1251,
round(100.0 * count(CASE WHEN httpcharset = 'KOI8-R' THEN 1 END ) / count(1),
2) AS KOI8_R
FROM RESULTS;
-- the row as column queries without pivot table tend to become verbose!
{code}
||Language||Total_Docs||UTF_8||ISO_8859_1||Windows_1256||Windows_1252||ISO_8859_15||GB2312||EUC_KR||ISO_8859_9||GBK||GB18030||EUC_JP||Shift_JIS||ISO_2022_JP||US_ASCII||ISO_8859_2||Windows_1251||KOI8_R||
||Arabic|1168|95.21|1.11|3|0.51|0.09|0.09|0|0|0|0|0|0|0|0|0|0|0|
||Chinese|3860|85.8|0.36|0|0|0|6.97|0|0|6.79|0.05|0.03|0|0|0|0|0|0|
||English|13582|96.43|3.38|0|0.09|0.07|0|0|0|0|0|0|0|0|0.01|0.01|0.01|0|
||French|8712|93.63|4.94|0|0.61|0.83|0|0|0|0|0|0|0|0|0|0|0|0|
||German|24884|91.24|7.1|0|0.23|1.39|0|0|0.01|0|0|0|0|0|0|0.02|0.01|0|
||Indian|6158|97.84|1.92|0|0|0.03|0|0.02|0.02|0|0|0|0.02|0|0|0|0.16|0|
||Italian|7371|94.44|4.95|0|0.34|0.26|0|0|0|0|0|0|0|0|0.01|0|0|0|
||Japanese|7736|89.88|0.28|0|0|0|0|0.03|0|0|0|3.94|5.82|0.04|0.01|0|0|0|
||Korean|735|78.64|0.82|0|0|0|0|20.54|0|0|0|0|0|0|0|0|0|0|
||Persian|4003|99.7|0.22|0.07|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
||Russian|28229|80.3|0.06|0|0|0|0|0|0|0|0|0|0|0|0|0|19.41|0.23|
||Spanish|6294|93.68|5.72|0|0.13|0.44|0|0|0|0|0|0|0|0|0.02|0.02|0|0|
||Turkish|1561|95.45|0.19|0|0|0|0|0|4.36|0|0|0|0|0|0|0|0|0|
||Vietnamese|1463|99.66|0.27|0|0.07|0|0|0|0|0|0|0|0|0|0|0|0|0|
||X-ALL|115756|90.19|3.1|0.03|0.14|0.41|0.23|0.13|0.06|0.23|0|0.26|0.39|0|0.01|0.01|4.75|0.06|
This table will might be useful to decide if the _stripping_ should be default
when we made it configurable.
{quote}This probably wasn't the case even 5 years ago
{quote}
I don't agree if it implies that without widespread use of UTF-8, IUST won't
outperform other algorithms. Because even if the share of UTF-8 is 50% IUST
will still outperform other algorithms. I developed it 6 years ago! :)
was (Author: faghani):
{quote}UTF-8 currently makes up over 92% of the web.
{quote}
You've well understanding of the context. I've noticed that you've pointed to
it in TIKA-2771, too. Hence I decided to provide further information about it
based on the _results_ table in _AGGREGATED-RESULTS.db_ in _fine-grained_
folder of [^lang-wise-eval_results.zip].
{code:sql}
WITH X AS (SELECT
language,
count(1) AS total_docs
FROM RESULTS
GROUP BY language)
SELECT
R.language AS Language, count(1) AS Total_Docs,
round(100 * count(CASE WHEN httpcharset = 'UTF-8' THEN 1 END ) /
X.total_docs, 2) AS UTF_8,
round(100 * count(CASE WHEN httpcharset = 'ISO-8859-1' THEN 1 END ) /
X.total_docs, 2) AS ISO_8859_1,
round(100 * count(CASE WHEN httpcharset = 'Windows-1256' THEN 1 END ) /
X.total_docs, 2) AS Windows_1256,
round(100 * count(CASE WHEN httpcharset = 'Windows-1252' THEN 1 END ) /
X.total_docs, 2) AS Windows_1252,
round(100 * count(CASE WHEN httpcharset = 'ISO-8859-15' THEN 1 END ) /
X.total_docs, 2) AS ISO_8859_15,
round(100 * count(CASE WHEN httpcharset = 'GB2312' THEN 1 END ) /
X.total_docs, 2) AS GB2312,
round(100 * count(CASE WHEN httpcharset = 'EUC-KR' THEN 1 END ) /
X.total_docs, 2) AS EUC_KR,
round(100 * count(CASE WHEN httpcharset = 'ISO-8859-9' THEN 1 END ) /
X.total_docs, 2) AS ISO_8859_9,
round(100 * count(CASE WHEN httpcharset = 'GBK' THEN 1 END ) / X.total_docs,
2) AS GBK,
round(100 * count(CASE WHEN httpcharset = 'GB18030' THEN 1 END ) /
X.total_docs, 2) AS GB18030,
round(100 * count(CASE WHEN httpcharset = 'EUC-JP' THEN 1 END ) /
X.total_docs, 2) AS EUC_JP,
round(100 * count(CASE WHEN httpcharset = 'Shift_JIS' THEN 1 END ) /
X.total_docs, 2) AS Shift_JIS,
round(100 * count(CASE WHEN httpcharset = 'ISO-2022-JP' THEN 1 END ) /
X.total_docs, 2) AS ISO_2022_JP,
round(100 * count(CASE WHEN httpcharset = 'US-ASCII' THEN 1 END ) /
X.total_docs, 2) AS US_ASCII,
round(100 * count(CASE WHEN httpcharset = 'ISO-8859-2' THEN 1 END ) /
X.total_docs, 2) AS ISO_8859_2,
round(100 * count(CASE WHEN httpcharset = 'Windows-1251' THEN 1 END ) /
X.total_docs, 2) AS Windows_1251,
round(100 * count(CASE WHEN httpcharset = 'KOI8-R' THEN 1 END ) /
X.total_docs, 2) AS KOI8_R
FROM RESULTS R
INNER JOIN X ON X.language = R.language
GROUP BY R.language
UNION ALL
SELECT
'X-ALL', count(1),
round(100.0 * count(CASE WHEN httpcharset = 'UTF-8' THEN 1 END ) / count(1),
2) AS UTF_8,
round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-1' THEN 1 END ) /
count(1), 2) AS ISO_8859_1,
round(100.0 * count(CASE WHEN httpcharset = 'Windows-1256' THEN 1 END ) /
count(1), 2) AS Windows_1256,
round(100.0 * count(CASE WHEN httpcharset = 'Windows-1252' THEN 1 END ) /
count(1), 2) AS Windows_1252,
round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-15' THEN 1 END ) /
count(1), 2) AS ISO_8859_15,
round(100.0 * count(CASE WHEN httpcharset = 'GB2312' THEN 1 END ) / count(1),
2) AS GB2312,
round(100.0 * count(CASE WHEN httpcharset = 'EUC-KR' THEN 1 END ) / count(1),
2) AS EUC_KR,
round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-9' THEN 1 END ) /
count(1), 2) AS ISO_8859_9,
round(100.0 * count(CASE WHEN httpcharset = 'GBK' THEN 1 END ) / count(1), 2)
AS GBK,
round(100.0 * count(CASE WHEN httpcharset = 'GB18030' THEN 1 END ) /
count(1), 2) AS GB18030,
round(100.0 * count(CASE WHEN httpcharset = 'EUC-JP' THEN 1 END ) / count(1),
2) AS EUC_JP,
round(100.0 * count(CASE WHEN httpcharset = 'Shift_JIS' THEN 1 END ) /
count(1), 2) AS Shift_JIS,
round(100.0 * count(CASE WHEN httpcharset = 'ISO-2022-JP' THEN 1 END ) /
count(1), 2) AS ISO_2022_JP,
round(100.0 * count(CASE WHEN httpcharset = 'US-ASCII' THEN 1 END ) /
count(1), 2) AS US_ASCII,
round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-2' THEN 1 END ) /
count(1), 2) AS ISO_8859_2,
round(100.0 * count(CASE WHEN httpcharset = 'Windows-1251' THEN 1 END ) /
count(1), 2) AS Windows_1251,
round(100.0 * count(CASE WHEN httpcharset = 'KOI8-R' THEN 1 END ) / count(1),
2) AS KOI8_R
FROM RESULTS;
-- without pivot table the row as column queries tend to become verbose!
{code}
||Language||Total_Docs||UTF_8||ISO_8859_1||Windows_1256||Windows_1252||ISO_8859_15||GB2312||EUC_KR||ISO_8859_9||GBK||GB18030||EUC_JP||Shift_JIS||ISO_2022_JP||US_ASCII||ISO_8859_2||Windows_1251||KOI8_R||
||Arabic|1168|95|1|2|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
||Chinese|3860|85|0|0|0|0|6|0|0|6|0|0|0|0|0|0|0|0|
||English|13582|96|3|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
||French|8712|93|4|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
||German|24884|91|7|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|
||Indian|6158|97|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
||Italian|7371|94|4|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
||Japanese|7736|89|0|0|0|0|0|0|0|0|0|3|5|0|0|0|0|0|
||Korean|735|78|0|0|0|0|0|20|0|0|0|0|0|0|0|0|0|0|
||Persian|4003|99|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
||Russian|28229|80|0|0|0|0|0|0|0|0|0|0|0|0|0|0|19|0|
||Spanish|6294|93|5|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
||Turkish|1561|95|0|0|0|0|0|0|4|0|0|0|0|0|0|0|0|0|
||Vietnamese|1463|99|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
||X-ALL|115756|90.19|3.1|0.03|0.14|0.41|0.23|0.13|0.06|0.23|0|0.26|0.39|0|0.01|0.01|4.75|0.06|
This table will might be useful to decide if the _stripping_ should be default
when we made it configurable.
{quote}This probably wasn't the case even 5 years ago
{quote}
I don't agree if it implies that without widespread use of UTF-8, IUST won't
outperform other algorithms. Because even if the share of UTF-8 is 50% IUST
will still outperform other algorithms. I developed it 6 years ago! :)
> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
> Key: TIKA-2038
> URL: https://issues.apache.org/jira/browse/TIKA-2038
> Project: Tika
> Issue Type: Improvement
> Components: core, detector
> Reporter: Shabanali Faghani
> Priority: Minor
> Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx,
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip,
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv,
> tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html.xlsx,
> tld_text_html_plus_H_column.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents
> as well as the other naturally text documents. But the accuracy of encoding
> detector tools, including icu4j, in dealing with the HTML documents is
> meaningfully less than from which the other text documents. Hence, in our
> project I developed a library that works pretty well for HTML documents,
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with
> the HTML documents, it seems that having such an facility in Tika also will
> help them to become more accurate.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)