This is an automated email from the ASF dual-hosted git repository.
tallison pushed a change to branch TIKA-4662
in repository https://gitbox.apache.org/repos/asf/tika.git
from b08ba012cf Merge branch 'main' into TIKA-4662
add 3efb1c3945 TIKA-4662 -- clean up and rat
No new revisions were added by this update.
Summary of changes:
.../advanced/charsoup-supported-languages.adoc | 17 +
.../advanced/lang-detection/flores-AUTOMATIC.log | 15 +
.../advanced/lang-detection/flores-SHORT_TEXT.log | 15 +
.../advanced/lang-detection/flores-STANDARD.log | 15 +
.../advanced/lang-detection/flores200-dev-eval.md | 17 +
.../lang-detection/language-drop-decisions.md | 17 +
.../short-text-language-decisions.md | 17 +
.../advanced/lang-detection/supported-languages.md | 17 +
.../tika/langdetect/charsoup/confusables.txt | 15 +
.../src/test/python/filter_uppercase.py | 15 +
tika-ml/tika-ml-chardetect/pom.xml | 85 ++++-
.../ml/chardetect/ByteNgramFeatureExtractor.java | 121 -------
.../tika/ml/chardetect/CharsetConfusables.java | 309 -----------------
.../chardetect/tools/BuildCharsetTrainingData.java | 21 +-
.../chardetect/ByteNgramFeatureExtractorTest.java | 82 +++--
.../ml/chardetect/tools/TrainCharsetModel.java | 364 ---------------------
...2273-encoding-detector-outside-static-init.json | 2 +-
.../TIKA-2273-no-icu4j-encoding-detector.json | 2 +-
.../tika/server/core/LanguageResourceTest.java | 10 +-
.../src/test/resources/test-documents/english.txt | 2 +-
20 files changed, 294 insertions(+), 864 deletions(-)
delete mode 100644
tika-ml/tika-ml-chardetect/src/main/java/org/apache/tika/ml/chardetect/ByteNgramFeatureExtractor.java
delete mode 100644
tika-ml/tika-ml-chardetect/src/main/java/org/apache/tika/ml/chardetect/CharsetConfusables.java
delete mode 100644
tika-ml/tika-ml-chardetect/src/test/java/org/apache/tika/ml/chardetect/tools/TrainCharsetModel.java