[jira] [Commented] (TIKA-2758) Possible error charset detection

Tim Allison (JIRA) Fri, 26 Oct 2018 08:28:33 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-2758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16665295#comment-16665295
 ]


Tim Allison commented on TIKA-2758:
-----------------------------------

I just attached grep_charsets.csv which shows the results of grepping for 
charset= in our current regression corpus (both reading the raw bytes as UTF-8 
and UTF16-LE).  Given that that was a raw grep against every file, not just the 
html files, there's a bunch of noise. 

However, it does look like there are quite a few 'utf8's.  

Given the initial point of TIKA-2592 (cc [~AndreasMeier]) was to handle 
'unicode' as 'utf-8', should we revert the whole check for 'unsupported by 
iana' and put in special handling only for 'unicode'?  

Perhaps we could also try to alias the charset string with {{CharsetAliases}}?

> Possible error charset detection
> --------------------------------
>
>                 Key: TIKA-2758
>                 URL: https://issues.apache.org/jira/browse/TIKA-2758
>             Project: Tika
>          Issue Type: Bug
>          Components: core
>    Affects Versions: 1.18
>            Reporter: Markus Jelsma
>            Priority: Major
>             Fix For: 1.20
>
>         Attachments: detroidnews.html, grep_charsets.csv, independent.html
>
>
> I started to upgrade our SAX parser Tika dependency from 1.17 to 1.19, ran 
> all 995 unit tests and observed three failures, two encoding issues and one 
> other weird thing. The tests use real HTML.
> Where we previously extracted text  such as 'Spokane, Wash. [— The solar' we 
> now got 'Spokane, Wash. [â€" The solar' in one test. The other had 'could 
> take ["weeks, or' but we not get 'could take [â€œweeks, or' extracted. Our 
> tests pass with 1.17 but fail with 1.18 and 1.19.1.
> Attached are the two HTML files.
> Reading our tests again, i see an old note besides the indepedent test 
> complaining about the character encoding being incorrect. It seems somewhere 
> before 1.17 it was faultly just as it is now with 1.18 and higher.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (TIKA-2758) Possible error charset detection

Reply via email to