Re: A powerful Charset Encoding Detector plugin for Nutch

Sebastian Nagel Thu, 19 May 2016 02:32:19 -0700

Hi Shabanali,

thanks for your offer! And sorry for the late reply.


Currently, in Nutch charset detection is not plugabble.
Because encoding is an integral part of document formats
It's a task for the parser because it's really tight to
document formats, and does work really different, e.g. for
HTML compared to PDF.

HTML charset detection is currently addressed
1) as part of the plugin parse-html, see [1] and [2]
2) plugin parse-tika, here not part of Nutch but "delegated"
   to the HTML parser of Apache Tika, see [3]

Tika also covers many other document formats, including formats (plain-text)
where charset detection is more difficult.  Tika is a parsing library,
the main difference to a web crawler is that there is extra context from
the web:
 - HTTP headers
 - encoding specified in links (currently, not used by Nutch):
    <a charset="ISO-8859-5" href="data/russian.txt">Russian</a>

Ev. Tika may be the better address to offer your work on improving the encoding
detection.

>From experience I know that the character set detection of Nutch is not
always perfect: German documents are sometimes not correctly decoded.

I would generally agree with the research results in your cited paper
- interpret data as ISO-8859-1 (just bytes, not multi-byte sequences)
  > that's done in sniffCharacterEncoding(), see [1]
- strip HTML including embedded css, scripts, etc.
  That's maybe the more reliable approach than increasing the size of
  the sniffed chunk (cf. [4]). But also more expensive regarding computation.
- to combine 2 detector libraries (Mozilla and ICU) to get the best of both
  is, of course, a nice trick.  But again: it may be too expensive
  in terms of the extra computation time


Ok, to come to an end: I'm sure you have a lot of good ideas for improvements.
And yes, help is always welcome. As Apache projects, Nutch and Tika rely
on the community to be involved in the development.

Instead of implementing a new charset detection plugin, a good approach and 
first
step could possibly be to test and evaluate the current state and provide 
real-world
test documents or even extend the existing JUnit tests.


Thanks,
Sebastian


[1] method sniffCharacterEncoding(byte[] content)

https://github.com/apache/nutch/blob/master/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java#L81
[2] 
https://nutch.apache.org/apidocs/apidocs-1.11/org/apache/nutch/util/EncodingDetector.html
    
https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/EncodingDetector.java
[3] http://tika.apache.org/
    
https://tika.apache.org/1.13/api/org/apache/tika/parser/html/HtmlEncodingDetector.html
    
https://tika.apache.org/1.13/api/org/apache/tika/detect/EncodingDetector.html
[4] https://issues.apache.org/jira/browse/NUTCH-2042







On 05/16/2016 10:37 PM, Shabanali Faghani wrote:
> Hi all,
> 
> This is my first post in Nutch's developer mailing list.
> A while ago, when I was working in a project I've developed a java library in 
> order to detect
> charset encoding of crawled HTML web pages. Before developing my library I 
> tested almost all Charset
> Detector tools including two Apache libraries namely TikaEncodingDetector and 
> Lucene-ICU4j but none
> were good for HTML documents.
> 
> Searching through google for the word "encoding" in Nutch's developer mailing 
> list archive
> <https://www.google.com/#q=site:http:%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fnutch-dev%2F+encoding>I
> found some related posts to this problem, so I decided to propose my tool 
> here.
> 
> Library code on github: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Paper link: http://link.springer.com/chapter/10.1007/978-3-319-28940-3_17
> Maven Central link: 
> http://mvnrepository.org/artifact/ir.ac.iust/htmlchardet/1.0.0
> 
> I'm acquaint with the Nutch Plugin policy 
> <https://wiki.apache.org/nutch/PluginCentral>and I know
> some plugins of Nutch such as LanguageIdentifier, which we used a modified 
> version of it in our
> project 4 years ago, are very useful in practice. Also, I know 
> EncodingDetectorPlugin
> <https://wiki.apache.org/nutch/EncodingDetectorPlugin> is in TODO list of 
> Nutch and it
> is prerequisite to some new plugins such as NewLanguageIdentifier
> <https://wiki.apache.org/nutch/NewLanguageIdentifier> to be applicable in 
> real life, as is
> stated here <https://wiki.apache.org/nutch/LanguageIdentifier>.
> 
> In a nutshell, I can develop a layer on my library according to the needs of 
> Nutch which are
> described here <https://issues.apache.org/jira/browse/NUTCH-25> (such as 
> considering the potential
> Charset in HTTP header) to become the EncodingDetectorPlugin of Nutch.
> 
> Please let me know your thoughts*.*

Re: A powerful Charset Encoding Detector plugin for Nutch

Reply via email to