Re: A powerful Charset Encoding Detector plugin for Nutch

Shabanali Faghani Mon, 23 May 2016 11:16:02 -0700

Hi Sebastian,

Thanks for your reply and sorry for my late reply, too :) ... due to the
recent 4 day holidays in our country.


I worked with the Apache's great stuffs like ActiveMQ, Camel, Zookeeper,
etc. in past and also I know Tika. We used Tika to parse PPT, DOC, XLS and
PDF documents in our project. Specially, for the latest case, i.e. PDF,
I've done a bug fix in Apache PDFBox. Also, I've used other Tika
sub-modules such as POI, Boilerpipe, TikaEncodingDetector, etc. separated
from Tika.

Hence, for suggesting my library I was in doubt that (which one is the
correct place) ? Nutch : Tika;
But for some reasons I felt that Nutch is a better place.

Anyway, I really thank you for your elaborated response and I'm eagerly
waiting to see your test result based on real-world test documents, though
my tests were done on real-word data too :) To extend the existing JUnit
tests I can grant your github user to my repo. Having your tests on my repo
helps me to yet improve it.

By the way, since our project was fairly big (now ~ 1.2 billion pages),
I've done performance test on my library. We can discuss on trad-off
between accuracy and performance and also about possible solutions to
improve performance in future but now I would just say that there is no
worry about it.

Regards,
Shabanali

On Thu, May 19, 2016 at 2:01 PM, Sebastian Nagel <[email protected]
> wrote:

> Hi Shabanali,
>
> thanks for your offer! And sorry for the late reply.
>
> Currently, in Nutch charset detection is not plugabble.
> Because encoding is an integral part of document formats
> It's a task for the parser because it's really tight to
> document formats, and does work really different, e.g. for
> HTML compared to PDF.
>
> HTML charset detection is currently addressed
> 1) as part of the plugin parse-html, see [1] and [2]
> 2) plugin parse-tika, here not part of Nutch but "delegated"
>    to the HTML parser of Apache Tika, see [3]
>
> Tika also covers many other document formats, including formats
> (plain-text)
> where charset detection is more difficult.  Tika is a parsing library,
> the main difference to a web crawler is that there is extra context from
> the web:
>  - HTTP headers
>  - encoding specified in links (currently, not used by Nutch):
>     <a charset="ISO-8859-5" href="data/russian.txt">Russian</a>
>
> Ev. Tika may be the better address to offer your work on improving the
> encoding
> detection.
>
> From experience I know that the character set detection of Nutch is not
> always perfect: German documents are sometimes not correctly decoded.
>
> I would generally agree with the research results in your cited paper
> - interpret data as ISO-8859-1 (just bytes, not multi-byte sequences)
>   > that's done in sniffCharacterEncoding(), see [1]
> - strip HTML including embedded css, scripts, etc.
>   That's maybe the more reliable approach than increasing the size of
>   the sniffed chunk (cf. [4]). But also more expensive regarding
> computation.
> - to combine 2 detector libraries (Mozilla and ICU) to get the best of both
>   is, of course, a nice trick.  But again: it may be too expensive
>   in terms of the extra computation time
>
>
> Ok, to come to an end: I'm sure you have a lot of good ideas for
> improvements.
> And yes, help is always welcome. As Apache projects, Nutch and Tika rely
> on the community to be involved in the development.
>
> Instead of implementing a new charset detection plugin, a good approach
> and first
> step could possibly be to test and evaluate the current state and provide
> real-world
> test documents or even extend the existing JUnit tests.
>
>
> Thanks,
> Sebastian
>
>
> [1] method sniffCharacterEncoding(byte[] content)
>
>
> https://github.com/apache/nutch/blob/master/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java#L81
> [2]
> https://nutch.apache.org/apidocs/apidocs-1.11/org/apache/nutch/util/EncodingDetector.html
>
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/EncodingDetector.java
> [3] http://tika.apache.org/
>
> https://tika.apache.org/1.13/api/org/apache/tika/parser/html/HtmlEncodingDetector.html
>
> https://tika.apache.org/1.13/api/org/apache/tika/detect/EncodingDetector.html
> [4] https://issues.apache.org/jira/browse/NUTCH-2042
>
>
>
>
>
>
>
> On 05/16/2016 10:37 PM, Shabanali Faghani wrote:
> > Hi all,
> >
> > This is my first post in Nutch's developer mailing list.
> > A while ago, when I was working in a project I've developed a java
> library in order to detect
> > charset encoding of crawled HTML web pages. Before developing my library
> I tested almost all Charset
> > Detector tools including two Apache libraries namely
> TikaEncodingDetector and Lucene-ICU4j but none
> > were good for HTML documents.
> >
> > Searching through google for the word "encoding" in Nutch's developer
> mailing list archive
> > <
> https://www.google.com/#q=site:http:%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fnutch-dev%2F+encoding
> >I
> > found some related posts to this problem, so I decided to propose my
> tool here.
> >
> > Library code on github:
> https://github.com/shabanali-faghani/IUST-HTMLCharDet
> > Paper link:
> http://link.springer.com/chapter/10.1007/978-3-319-28940-3_17
> > Maven Central link:
> http://mvnrepository.org/artifact/ir.ac.iust/htmlchardet/1.0.0
> >
> > I'm acquaint with the Nutch Plugin policy <
> https://wiki.apache.org/nutch/PluginCentral>and I know
> > some plugins of Nutch such as LanguageIdentifier, which we used a
> modified version of it in our
> > project 4 years ago, are very useful in practice. Also, I know
> EncodingDetectorPlugin
> > <https://wiki.apache.org/nutch/EncodingDetectorPlugin> is in TODO list
> of Nutch and it
> > is prerequisite to some new plugins such as NewLanguageIdentifier
> > <https://wiki.apache.org/nutch/NewLanguageIdentifier> to be applicable
> in real life, as is
> > stated here <https://wiki.apache.org/nutch/LanguageIdentifier>.
> >
> > In a nutshell, I can develop a layer on my library according to the
> needs of Nutch which are
> > described here <https://issues.apache.org/jira/browse/NUTCH-25> (such
> as considering the potential
> > Charset in HTTP header) to become the EncodingDetectorPlugin of Nutch.
> >
> > Please let me know your thoughts*.*
>
>

Re: A powerful Charset Encoding Detector plugin for Nutch

Reply via email to