[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Kathrine Colyn (JIRA) Tue, 26 Jul 2016 02:05:36 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15393471#comment-15393471
 ]


Kathrine Colyn commented on TIKA-2038:
--------------------------------------

If you use Maven as dependency manager, just place this dependency into your 
POM's <dependencies> section:
<dependency>
    <groupId>ir.ac.iust</groupId>
    <artifactId>htmlchardet</artifactId>
    <version>1.0.1</version>
</dependency>

Scala SBT

libraryDependencies += "ir.ac.iust" % "htmlchardet" % "1.0.1"

If you don't use any dependency manager and have a pure java project you can 
download htmlchardet-1.0.1.jar either from inside the wiki folder or from 
mvnrepository.org. In this case you need also the first 4 dependency; those 
were mentioned in the pom.xml file to get it to work for you.

If you trust in the Web Server or trust in the Website that the pages are 
crawled from, you can use this tool as follows:
String charset = HTMLCharsetDetector.detect(htmlByteArray, true);

otherwise, call detect method with false value for lookInMeta argument as 
follows:

String charset = HTMLCharsetDetector.detect(htmlByteArray, false);
Also, there is another detection method with #detect(byte[] 
rawHtmlByteSequence) signature, but I don't recommend to use it. To see why, 
please refer to its javadoc.

Thanks !!
[More Info|http://qwikfix.co.uk/apple-contact-number/]

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to