Hi Sebastian, Thanks for your reply and sorry for my late reply, too :) ... due to the recent 4 day holidays in our country.
I worked with the Apache's great stuffs like ActiveMQ, Camel, Zookeeper, etc. in past and also I know Tika. We used Tika to parse PPT, DOC, XLS and PDF documents in our project. Specially, for the latest case, i.e. PDF, I've done a bug fix in Apache PDFBox. Also, I've used other Tika sub-modules such as POI, Boilerpipe, TikaEncodingDetector, etc. separated from Tika. Hence, for suggesting my library I was in doubt that (which one is the correct place) ? Nutch : Tika; But for some reasons I felt that Nutch is a better place. Anyway, I really thank you for your elaborated response and I'm eagerly waiting to see your test result based on real-world test documents, though my tests were done on real-word data too :) To extend the existing JUnit tests I can grant your github user to my repo. Having your tests on my repo helps me to yet improve it. By the way, since our project was fairly big (now ~ 1.2 billion pages), I've done performance test on my library. We can discuss on trad-off between accuracy and performance and also about possible solutions to improve performance in future but now I would just say that there is no worry about it. Regards, Shabanali On Thu, May 19, 2016 at 2:01 PM, Sebastian Nagel <[email protected] > wrote: > Hi Shabanali, > > thanks for your offer! And sorry for the late reply. > > Currently, in Nutch charset detection is not plugabble. > Because encoding is an integral part of document formats > It's a task for the parser because it's really tight to > document formats, and does work really different, e.g. for > HTML compared to PDF. > > HTML charset detection is currently addressed > 1) as part of the plugin parse-html, see [1] and [2] > 2) plugin parse-tika, here not part of Nutch but "delegated" > to the HTML parser of Apache Tika, see [3] > > Tika also covers many other document formats, including formats > (plain-text) > where charset detection is more difficult. Tika is a parsing library, > the main difference to a web crawler is that there is extra context from > the web: > - HTTP headers > - encoding specified in links (currently, not used by Nutch): > <a charset="ISO-8859-5" href="data/russian.txt">Russian</a> > > Ev. Tika may be the better address to offer your work on improving the > encoding > detection. > > From experience I know that the character set detection of Nutch is not > always perfect: German documents are sometimes not correctly decoded. > > I would generally agree with the research results in your cited paper > - interpret data as ISO-8859-1 (just bytes, not multi-byte sequences) > > that's done in sniffCharacterEncoding(), see [1] > - strip HTML including embedded css, scripts, etc. > That's maybe the more reliable approach than increasing the size of > the sniffed chunk (cf. [4]). But also more expensive regarding > computation. > - to combine 2 detector libraries (Mozilla and ICU) to get the best of both > is, of course, a nice trick. But again: it may be too expensive > in terms of the extra computation time > > > Ok, to come to an end: I'm sure you have a lot of good ideas for > improvements. > And yes, help is always welcome. As Apache projects, Nutch and Tika rely > on the community to be involved in the development. > > Instead of implementing a new charset detection plugin, a good approach > and first > step could possibly be to test and evaluate the current state and provide > real-world > test documents or even extend the existing JUnit tests. > > > Thanks, > Sebastian > > > [1] method sniffCharacterEncoding(byte[] content) > > > https://github.com/apache/nutch/blob/master/src/plugin/parse-html/src/java/org/apache/nutch/parse/html/HtmlParser.java#L81 > [2] > https://nutch.apache.org/apidocs/apidocs-1.11/org/apache/nutch/util/EncodingDetector.html > > https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/EncodingDetector.java > [3] http://tika.apache.org/ > > https://tika.apache.org/1.13/api/org/apache/tika/parser/html/HtmlEncodingDetector.html > > https://tika.apache.org/1.13/api/org/apache/tika/detect/EncodingDetector.html > [4] https://issues.apache.org/jira/browse/NUTCH-2042 > > > > > > > > On 05/16/2016 10:37 PM, Shabanali Faghani wrote: > > Hi all, > > > > This is my first post in Nutch's developer mailing list. > > A while ago, when I was working in a project I've developed a java > library in order to detect > > charset encoding of crawled HTML web pages. Before developing my library > I tested almost all Charset > > Detector tools including two Apache libraries namely > TikaEncodingDetector and Lucene-ICU4j but none > > were good for HTML documents. > > > > Searching through google for the word "encoding" in Nutch's developer > mailing list archive > > < > https://www.google.com/#q=site:http:%2F%2Fmail-archives.apache.org%2Fmod_mbox%2Fnutch-dev%2F+encoding > >I > > found some related posts to this problem, so I decided to propose my > tool here. > > > > Library code on github: > https://github.com/shabanali-faghani/IUST-HTMLCharDet > > Paper link: > http://link.springer.com/chapter/10.1007/978-3-319-28940-3_17 > > Maven Central link: > http://mvnrepository.org/artifact/ir.ac.iust/htmlchardet/1.0.0 > > > > I'm acquaint with the Nutch Plugin policy < > https://wiki.apache.org/nutch/PluginCentral>and I know > > some plugins of Nutch such as LanguageIdentifier, which we used a > modified version of it in our > > project 4 years ago, are very useful in practice. Also, I know > EncodingDetectorPlugin > > <https://wiki.apache.org/nutch/EncodingDetectorPlugin> is in TODO list > of Nutch and it > > is prerequisite to some new plugins such as NewLanguageIdentifier > > <https://wiki.apache.org/nutch/NewLanguageIdentifier> to be applicable > in real life, as is > > stated here <https://wiki.apache.org/nutch/LanguageIdentifier>. > > > > In a nutshell, I can develop a layer on my library according to the > needs of Nutch which are > > described here <https://issues.apache.org/jira/browse/NUTCH-25> (such > as considering the potential > > Charset in HTTP header) to become the EncodingDetectorPlugin of Nutch. > > > > Please let me know your thoughts*.* > >

