Tim Allison created TIKA-2940:
---------------------------------
Summary: Consider an ensemble charset detection method
Key: TIKA-2940
URL: https://issues.apache.org/jira/browse/TIKA-2940
Project: Tika
Issue Type: Improvement
Reporter: Tim Allison
I recently ran our four charset detectors against our text based files.
The raw data is available here:
http://162.242.228.174/encoding_detection/charsets_combined_201909.sql.zip (in
sql form) or
http://162.242.228.174/encoding_detection/charsets_combined_201909.csv.zip (in
a csv).
I've posted a preliminary/draft report here:
https://github.com/tballison/share/blob/master/slides/Tika_charset_detector_study_201909.docx
In general, we could see a ~1.4% improvement in "common tokens"[0] if we used
an ensemble approach _on our corpus_. For users with more homogeneous
documents, this improvement could be far greater (e.g. if their documents _all_
come from a content management system that is applying an incorrect html-meta
charset header).
I'm opening this issue for discussion and as encouragement for others to work
with the raw data and/or make recommendations on the preliminary report's
methodology.
[0] "common tokens" in tika-eval refers to the lists we developed of the top
30k most common words per 118 languages covered in tika-eval. It can be a sign
of improved extraction if the total number of "common tokens" increases.
--
This message was sent by Atlassian Jira
(v8.3.2#803003)