[ 
https://issues.apache.org/jira/browse/ANY23-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16569596#comment-16569596
 ] 

ASF GitHub Bot commented on ANY23-385:
--------------------------------------

GitHub user HansBrende opened a pull request:

    https://github.com/apache/any23/pull/115

    ANY23-385 improve encoding detection

    1. Increase default sniff limit for text charset detection from 12000 bytes 
to 65536 bytes
    2. Include results of xml declaration encoding detection
    3. Include results of html meta charset encoding detection
    
    mvn clean test -> all tests passed

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/HansBrende/any23 ANY23-385

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/any23/pull/115.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #115
    
----
commit 22b3047d55f5e5b8fcba9c912424c9ed45313163
Author: Hans <firedrake93@...>
Date:   2018-08-05T23:39:01Z

    ANY23-385 improve encoding detection

----


> Improve charset detection for (x)html documents
> -----------------------------------------------
>
>                 Key: ANY23-385
>                 URL: https://issues.apache.org/jira/browse/ANY23-385
>             Project: Apache Any23
>          Issue Type: Improvement
>          Components: encoding
>    Affects Versions: 2.3
>            Reporter: Hans Brende
>            Assignee: Hans Brende
>            Priority: Major
>             Fix For: 2.3
>
>
> When attempting to detect a document's encoding, our {{TikaEncodingDetector}} 
> does not take into account the following elements which may occur in 
> html/xhtml documents:
> HTML:
> {{<meta http-equiv="content-type" content="text/html; charset=xyz"/>}}
> HTML5: 
> {{<meta charset="xyz">}}
> XHTML:
> {{<?xml encoding='xyz'?>}}
> In addition, the {{TikaEncodingDetector}} only sniffs the first 12000 bytes 
> of the document, meaning that if, for example, the first UTF-8 encoded 
> character occurs later than that, the detector may misidentify the encoding 
> as ISO-8859-1 or Windows-1252 instead of UTF-8 (even if UTF-8 were specified 
> in the meta charset element of the page.) 
> I have seen this problem occur with, e.g., the webpage 
> http://losangeles.eventful.com/events/september (where the UTF-8 charset was 
> properly specified at the top of the page, but the first UTF-8 encoded 
> characters occurred far past the 12000 byte mark in JSON-LD content towards 
> the bottom of the page, causing the TikaEncodingDetector to misidentify the 
> encoding as ISO-8859-1, causing certain JSON-LD strings to come out looking 
> like gibberish).



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to