[ 
https://issues.apache.org/jira/browse/ANY23-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677327#comment-16677327
 ] 

ASF GitHub Bot commented on ANY23-418:
--------------------------------------

GitHub user HansBrende opened a pull request:

    https://github.com/apache/any23/pull/131

    ANY23-418 improve TikaEncodingDetector

    Improves TikaEncodingDetector by:
    
    1. Not second-guessing UTF-8 if there is *any* indication that a stream is 
UTF-8-encoded. We can't afford false positives from obscure, obsolete charsets 
such as IBM500 (See 
[TIKA-2771](https://issues.apache.org/jira/browse/TIKA-2771)).
    2. Taking entire stream into account rather than a prefix (this shouldn't 
be a huge memory issue, as we are already holding the entire stream in memory 
to pass to each extractor, and extractors such as RDFa already parse the entire 
content into a DOM before generating the triples. If we want to make Any23 
"streaming"-capable in the future to reduce memory requirements, we can look 
into that, but for now, since we're not, we may as well use that to our 
advantage to be more accurate in charset detection.)
    3. Taking [TIKA-2771](https://issues.apache.org/jira/browse/TIKA-2771), 
[TIKA-2038](https://issues.apache.org/jira/browse/TIKA-2038), and 
[TIKA-539](https://issues.apache.org/jira/browse/TIKA-539) into account.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HansBrende/any23 ANY23-418

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/any23/pull/131.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #131
    
----
commit d64dac9dfe0752c45d3ff9fbca37bbe447e5c55b
Author: Hans <firedrake93@...>
Date:   2018-11-06T21:27:00Z

    ANY23-418 improve TikaEncodingDetector

----


> Take another look at encoding detection
> ---------------------------------------
>
>                 Key: ANY23-418
>                 URL: https://issues.apache.org/jira/browse/ANY23-418
>             Project: Apache Any23
>          Issue Type: Improvement
>          Components: encoding
>    Affects Versions: 2.3
>            Reporter: Hans Brende
>            Priority: Major
>             Fix For: 2.3
>
>
> In order to address various shortcomings of Tika encoding detection, I've had 
> to modify the TikaEncodingDetector several times. Cf. ANY23-385 and 
> ANY23-411. In the former, I placed a much greater weight on detected charsets 
> declared in html meta elements & xml declarations. In the latter, I placed a 
> much greater weight on charsets returned from HTTP Content-Type headers.
> However, after taking a look at TIKA-539, I'm thinking I should reduce this 
> added weight (for at least html meta elements), and perhaps ignore it 
> altogether (unless it happens to match UTF-8, since it seems that incorrect 
> declarations usually declare something *other than* UTF-8, when the correct 
> charset should be UTF-8).
> Something like > 90% of all webpages use UTF-8 encoding, and all of our 
> encoding detection errors to date have revolved around *something other than 
> UTF-8* being detected when the correct encoding was actually UTF-8, not the 
> other way around.
> Therefore, what I propose is the following: 
> (1) In the absence of a Content-Type header, any declared hints that the 
> charset is UTF-8 should add to the weight for UTF-8, while any declared hints 
> that the charset is not UTF-8 should be ignored. 
> (2) In the presence of a Content-Type header, any other declared hints should 
> be ignored, unless they match UTF-8 and do not match the Content-Type header, 
> in which case all hints, including the Content-Type header, should be ignored.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to