Charset detection algorithm

Ken Krugler Sat, 06 Nov 2010 12:03:55 -0700

Hi all,

See https://issues.apache.org/jira/browse/TIKA-539 for a Tika issueI'm currently working on, which has to do with the charset detectionalgorithm.


There's the HTML5 proposal, where the priority is

- charset from Content-Type response header
- charset from HTML <meta http-equiv content-type> element
- charset detected from page contents

Reinhard Schwab proposed a variation on the HTML5 approach, whichmakes sense to me; in my web crawling experience, too many servers lieto just blindly trust the response header contents.

I've got a slight modification to Reinhard's approach, as describe ina comment on the above issue:

https://issues.apache.org/jira/browse/TIKA-539?focusedCommentId=12928832&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12928832


I'm interested in comments.

Thanks!

-- Ken

--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g

Charset detection algorithm

Reply via email to