[jira] [Commented] (ANY23-418) Take another look at encoding detection

ASF GitHub Bot (JIRA) Tue, 06 Nov 2018 15:54:53 -0800


    [ 
https://issues.apache.org/jira/browse/ANY23-418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16677435#comment-16677435
 ]


ASF GitHub Bot commented on ANY23-418:
--------------------------------------

Github user HansBrende commented on the issue:

    https://github.com/apache/any23/pull/131
  
    @lewismc any thoughts about this?


> Take another look at encoding detection
> ---------------------------------------
>
>                 Key: ANY23-418
>                 URL: https://issues.apache.org/jira/browse/ANY23-418
>             Project: Apache Any23
>          Issue Type: Improvement
>          Components: encoding
>    Affects Versions: 2.3
>            Reporter: Hans Brende
>            Priority: Major
>             Fix For: 2.3
>
>
> In order to address various shortcomings of Tika encoding detection, I've had 
> to modify the TikaEncodingDetector several times. Cf. ANY23-385 and 
> ANY23-411. In the former, I placed a much greater weight on detected charsets 
> declared in html meta elements & xml declarations. In the latter, I placed a 
> much greater weight on charsets returned from HTTP Content-Type headers.
> However, after taking a look at TIKA-539, I'm thinking I should reduce this 
> added weight (for at least html meta elements), and perhaps ignore it 
> altogether (unless it happens to match UTF-8, since it seems that incorrect 
> declarations usually declare something *other than* UTF-8, when the correct 
> charset should be UTF-8).
> Something like > 90% of all webpages use UTF-8 encoding, and all of our 
> encoding detection errors to date have revolved around *something other than 
> UTF-8* being detected when the correct encoding was actually UTF-8, not the 
> other way around.
> Therefore, what I propose is the following: 
> (1) In the absence of a Content-Type header, any declared hints that the 
> charset is UTF-8 should add to the weight for UTF-8, while any declared hints 
> that the charset is not UTF-8 should be ignored. 
> (2) In the presence of a Content-Type header, any other declared hints should 
> be ignored, unless they match UTF-8 and do not match the Content-Type header, 
> in which case all hints, including the Content-Type header, should be ignored.
>  EDIT: The above 2 points are a simplification of what I've actually 
> implemented (specifically, I don't necessarily ignore non-UTF-8 hints). See 
> PR 131 for details.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (ANY23-418) Take another look at encoding detection

Reply via email to