GitHub user HansBrende opened a pull request:
https://github.com/apache/any23/pull/131
ANY23-418 improve TikaEncodingDetector
Improves TikaEncodingDetector by:
1. Not second-guessing UTF-8 if there is *any* indication that a stream is
UTF-8-encoded. We can't afford false positives from obscure, obsolete charsets
such as IBM500 (See
[TIKA-2771](https://issues.apache.org/jira/browse/TIKA-2771)).
2. Taking entire stream into account rather than a prefix (this shouldn't
be a huge memory issue, as we are already holding the entire stream in memory
to pass to each extractor, and extractors such as RDFa already parse the entire
content into a DOM before generating the triples. If we want to make Any23
"streaming"-capable in the future to reduce memory requirements, we can look
into that, but for now, since we're not, we may as well use that to our
advantage to be more accurate in charset detection.)
3. Taking [TIKA-2771](https://issues.apache.org/jira/browse/TIKA-2771),
[TIKA-2038](https://issues.apache.org/jira/browse/TIKA-2038), and
[TIKA-539](https://issues.apache.org/jira/browse/TIKA-539) into account.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/HansBrende/any23 ANY23-418
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/any23/pull/131.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #131
----
commit d64dac9dfe0752c45d3ff9fbca37bbe447e5c55b
Author: Hans <firedrake93@...>
Date: 2018-11-06T21:27:00Z
ANY23-418 improve TikaEncodingDetector
----
---