tika-dev  

Character encodings on the web

Jukka Zitting
Fri, 29 Jan 2010 04:16:49 -0800

Hi,

Interesting graph from Google about the relative usage of different
character encodings:

    http://googleblog.blogspot.com/2010/01/unicode-nearing-50-of-web.html

It's interesting to see that the Unicode entry only lists the UTF-8
encoding. Are the other Unicode encodings so infrequent?

I think we can use this data as a guideline when optimizing the
encoding detection code in Tika.

BR,

Jukka Zitting