[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Shabanali Faghani (JIRA) Sun, 02 Dec 2018 14:31:02 -0800


    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16706496#comment-16706496
 ]


Shabanali Faghani commented on TIKA-2038:
-----------------------------------------

{quote}UTF-8 currently makes up over 92% of the web.
{quote}
You've well understanding of the context. I've noticed that you've pointed to 
it in TIKA-2771, too. Hence I decided to provide further information about it 
based on the _results_ table in _AGGREGATED-RESULTS.db_ in _fine-grained_ 
folder of [^lang-wise-eval_results.zip].
{code:sql}
WITH X AS (SELECT
             language,
             count(1) AS total_docs
           FROM RESULTS
           GROUP BY language)
SELECT 
  R.language AS Language, count(1) AS Total_Docs,
  round(100 * count(CASE WHEN httpcharset = 'UTF-8' THEN 1 END ) / 
X.total_docs, 2) AS UTF_8,
  round(100 * count(CASE WHEN httpcharset = 'ISO-8859-1' THEN 1 END ) / 
X.total_docs, 2) AS ISO_8859_1,
  round(100 * count(CASE WHEN httpcharset = 'Windows-1256' THEN 1 END ) / 
X.total_docs, 2) AS Windows_1256,
  round(100 * count(CASE WHEN httpcharset = 'Windows-1252' THEN 1 END ) / 
X.total_docs, 2) AS Windows_1252,
  round(100 * count(CASE WHEN httpcharset = 'ISO-8859-15' THEN 1 END ) / 
X.total_docs, 2) AS ISO_8859_15,
  round(100 * count(CASE WHEN httpcharset = 'GB2312' THEN 1 END ) / 
X.total_docs, 2) AS GB2312,
  round(100 * count(CASE WHEN httpcharset = 'EUC-KR' THEN 1 END ) / 
X.total_docs, 2) AS EUC_KR,
  round(100 * count(CASE WHEN httpcharset = 'ISO-8859-9' THEN 1 END ) / 
X.total_docs, 2) AS ISO_8859_9,
  round(100 * count(CASE WHEN httpcharset = 'GBK' THEN 1 END ) / X.total_docs, 
2) AS GBK,
  round(100 * count(CASE WHEN httpcharset = 'GB18030' THEN 1 END ) / 
X.total_docs, 2) AS GB18030,
  round(100 * count(CASE WHEN httpcharset = 'EUC-JP' THEN 1 END ) / 
X.total_docs, 2) AS EUC_JP,
  round(100 * count(CASE WHEN httpcharset = 'Shift_JIS' THEN 1 END ) / 
X.total_docs, 2) AS Shift_JIS,
  round(100 * count(CASE WHEN httpcharset = 'ISO-2022-JP' THEN 1 END ) / 
X.total_docs, 2) AS ISO_2022_JP,
  round(100 * count(CASE WHEN httpcharset = 'US-ASCII' THEN 1 END ) / 
X.total_docs, 2) AS US_ASCII,
  round(100 * count(CASE WHEN httpcharset = 'ISO-8859-2' THEN 1 END ) / 
X.total_docs, 2) AS ISO_8859_2,
  round(100 * count(CASE WHEN httpcharset = 'Windows-1251' THEN 1 END ) / 
X.total_docs, 2) AS Windows_1251,
  round(100 * count(CASE WHEN httpcharset = 'KOI8-R' THEN 1 END ) / 
X.total_docs, 2) AS KOI8_R
FROM RESULTS R
INNER JOIN X ON X.language = R.language
GROUP BY R.language
UNION ALL
SELECT 
  'X-ALL', count(1),
  round(100.0 * count(CASE WHEN httpcharset = 'UTF-8' THEN 1 END ) / count(1), 
2) AS UTF_8,
  round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-1' THEN 1 END ) / 
count(1), 2) AS ISO_8859_1,
  round(100.0 * count(CASE WHEN httpcharset = 'Windows-1256' THEN 1 END ) / 
count(1), 2) AS Windows_1256,
  round(100.0 * count(CASE WHEN httpcharset = 'Windows-1252' THEN 1 END ) / 
count(1), 2) AS Windows_1252,
  round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-15' THEN 1 END ) / 
count(1), 2) AS ISO_8859_15,
  round(100.0 * count(CASE WHEN httpcharset = 'GB2312' THEN 1 END ) / count(1), 
2) AS GB2312,
  round(100.0 * count(CASE WHEN httpcharset = 'EUC-KR' THEN 1 END ) / count(1), 
2) AS EUC_KR,
  round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-9' THEN 1 END ) / 
count(1), 2) AS ISO_8859_9,
  round(100.0 * count(CASE WHEN httpcharset = 'GBK' THEN 1 END ) / count(1), 2) 
AS GBK,
  round(100.0 * count(CASE WHEN httpcharset = 'GB18030' THEN 1 END ) / 
count(1), 2) AS GB18030,
  round(100.0 * count(CASE WHEN httpcharset = 'EUC-JP' THEN 1 END ) / count(1), 
2) AS EUC_JP,
  round(100.0 * count(CASE WHEN httpcharset = 'Shift_JIS' THEN 1 END ) / 
count(1), 2) AS Shift_JIS,
  round(100.0 * count(CASE WHEN httpcharset = 'ISO-2022-JP' THEN 1 END ) / 
count(1), 2) AS ISO_2022_JP,
  round(100.0 * count(CASE WHEN httpcharset = 'US-ASCII' THEN 1 END ) / 
count(1), 2) AS US_ASCII,
  round(100.0 * count(CASE WHEN httpcharset = 'ISO-8859-2' THEN 1 END ) / 
count(1), 2) AS ISO_8859_2,
  round(100.0 * count(CASE WHEN httpcharset = 'Windows-1251' THEN 1 END ) / 
count(1), 2) AS Windows_1251,
  round(100.0 * count(CASE WHEN httpcharset = 'KOI8-R' THEN 1 END ) / count(1), 
2) AS KOI8_R
FROM RESULTS;
 
-- without pivot table the row as column queries tend to become verbose!
{code}
||Language||Total_Docs||UTF_8||ISO_8859_1||Windows_1256||Windows_1252||ISO_8859_15||GB2312||EUC_KR||ISO_8859_9||GBK||GB18030||EUC_JP||Shift_JIS||ISO_2022_JP||US_ASCII||ISO_8859_2||Windows_1251||KOI8_R||
||Arabic|1168|95|1|2|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
||Chinese|3860|85|0|0|0|0|6|0|0|6|0|0|0|0|0|0|0|0|
||English|13582|96|3|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
||French|8712|93|4|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
||German|24884|91|7|0|0|1|0|0|0|0|0|0|0|0|0|0|0|0|
||Indian|6158|97|1|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
||Italian|7371|94|4|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
||Japanese|7736|89|0|0|0|0|0|0|0|0|0|3|5|0|0|0|0|0|
||Korean|735|78|0|0|0|0|0|20|0|0|0|0|0|0|0|0|0|0|
||Persian|4003|99|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
||Russian|28229|80|0|0|0|0|0|0|0|0|0|0|0|0|0|0|19|0|
||Spanish|6294|93|5|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
||Turkish|1561|95|0|0|0|0|0|0|4|0|0|0|0|0|0|0|0|0|
||Vietnamese|1463|99|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|0|
||X-ALL|115756|90.19|3.1|0.03|0.14|0.41|0.23|0.13|0.06|0.23|0|0.26|0.39|0|0.01|0.01|4.75|0.06|

This table will might be useful to decide if the _stripping_ should be default 
when we made it configurable.
  
{quote}This probably wasn't the case even 5 years ago
{quote}
I don't agree if it implies that without widespread use of UTF-8, IUST won't 
outperform other algorithms. Because even if the share of UTF-8 is 50% IUST 
will still outperform other algorithms. I developed it 6 years ago! :)

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, comparisons_20160804.xlsx, 
> iust_encodings.zip, lang-wise-eval_results.zip, lang-wise-eval_runnable.zip, 
> lang-wise-eval_source_code.zip, proposedTLDSampling.csv, 
> tika_1_14-SNAPSHOT_encoding_detector.zip, tld_text_html.xlsx, 
> tld_text_html_plus_H_column.xlsx
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to