Hi all, This is my first post in Nutch's developer mailing list. A while ago, when I was working in a project I've developed a java library in order to detect charset encoding of crawled HTML web pages. Before developing my library I tested almost all Charset Detector tools including two Apache libraries namely TikaEncodingDetector and Lucene-ICU4j but none were good for HTML documents.
Searching through google for the word "encoding" in Nutch's developer mailing list archive <https://www.google.com/#q=site:http:%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fnutch-dev%2F+encoding> I found some related posts to this problem, so I decided to propose my tool here. Library code on github: https://github.com/shabanali-faghani/IUST-HTMLCharDet Paper link: http://link.springer.com/chapter/10.1007/978-3-319-28940-3_17 Maven Central link: http://mvnrepository.org/artifact/ir.ac.iust/htmlchardet/1.0.0 I'm acquaint with the Nutch Plugin policy <https://wiki.apache.org/nutch/PluginCentral> and I know some plugins of Nutch such as LanguageIdentifier, which we used a modified version of it in our project 4 years ago, are very useful in practice. Also, I know EncodingDetectorPlugin <https://wiki.apache.org/nutch/EncodingDetectorPlugin> is in TODO list of Nutch and it is prerequisite to some new plugins such as NewLanguageIdentifier <https://wiki.apache.org/nutch/NewLanguageIdentifier> to be applicable in real life, as is stated here <https://wiki.apache.org/nutch/LanguageIdentifier>. In a nutshell, I can develop a layer on my library according to the needs of Nutch which are described here <https://issues.apache.org/jira/browse/NUTCH-25> (such as considering the potential Charset in HTTP header) to become the EncodingDetectorPlugin of Nutch. Please let me know your thoughts*.*

