GitHub user HansBrende opened a pull request:

    https://github.com/apache/any23/pull/131

    ANY23-418 improve TikaEncodingDetector

    Improves TikaEncodingDetector by:
    
    1. Not second-guessing UTF-8 if there is *any* indication that a stream is 
UTF-8-encoded. We can't afford false positives from obscure, obsolete charsets 
such as IBM500 (See 
[TIKA-2771](https://issues.apache.org/jira/browse/TIKA-2771)).
    2. Taking entire stream into account rather than a prefix (this shouldn't 
be a huge memory issue, as we are already holding the entire stream in memory 
to pass to each extractor, and extractors such as RDFa already parse the entire 
content into a DOM before generating the triples. If we want to make Any23 
"streaming"-capable in the future to reduce memory requirements, we can look 
into that, but for now, since we're not, we may as well use that to our 
advantage to be more accurate in charset detection.)
    3. Taking [TIKA-2771](https://issues.apache.org/jira/browse/TIKA-2771), 
[TIKA-2038](https://issues.apache.org/jira/browse/TIKA-2038), and 
[TIKA-539](https://issues.apache.org/jira/browse/TIKA-539) into account.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HansBrende/any23 ANY23-418

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/any23/pull/131.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #131
    
----
commit d64dac9dfe0752c45d3ff9fbca37bbe447e5c55b
Author: Hans <firedrake93@...>
Date:   2018-11-06T21:27:00Z

    ANY23-418 improve TikaEncodingDetector

----


---

Reply via email to