A powerful Charset Encoding Detector plugin for Nutch

Shabanali Faghani Mon, 16 May 2016 13:38:29 -0700

Hi all,

This is my first post in Nutch's developer mailing list.
A while ago, when I was working in a project I've developed a java library
in order to detect charset encoding of crawled HTML web pages. Before
developing my library I tested almost all Charset Detector tools including two
Apache libraries namely TikaEncodingDetector and Lucene-ICU4j but none were
good for HTML documents.


Searching through google for the word "encoding" in Nutch's developer
mailing list archive
<https://www.google.com/#q=site:http:%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fnutch-dev%2F+encoding>
I found some related posts to this problem, so I decided to propose my tool
here.

Library code on github:
https://github.com/shabanali-faghani/IUST-HTMLCharDet
Paper link: http://link.springer.com/chapter/10.1007/978-3-319-28940-3_17
Maven Central link:
http://mvnrepository.org/artifact/ir.ac.iust/htmlchardet/1.0.0

I'm acquaint with the Nutch Plugin policy
<https://wiki.apache.org/nutch/PluginCentral> and I know some plugins of Nutch
such as LanguageIdentifier, which we used a modified version of it in our
project 4 years ago, are very useful in practice. Also, I know
EncodingDetectorPlugin
<https://wiki.apache.org/nutch/EncodingDetectorPlugin> is in TODO list of
Nutch and it is prerequisite to some new plugins such as
NewLanguageIdentifier <https://wiki.apache.org/nutch/NewLanguageIdentifier> to
be applicable in real life, as is stated here
<https://wiki.apache.org/nutch/LanguageIdentifier>.

In a nutshell, I can develop a layer on my library according to the needs
of Nutch which are described here
<https://issues.apache.org/jira/browse/NUTCH-25> (such as considering the
potential Charset in HTTP header) to become the EncodingDetectorPlugin of
 Nutch.

Please let me know your thoughts*.*

A powerful Charset Encoding Detector plugin for Nutch

Reply via email to