[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Shabanali Faghani (JIRA) Sat, 23 Jul 2016 04:53:59 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15390623#comment-15390623
 ]


Shabanali Faghani edited comment on TIKA-2038 at 7/23/16 11:52 AM:
-------------------------------------------------------------------

My preference is option 1).
In this case I can freely work on my library and then just announce you version 
number of the latest release. Because, although it works now fine but there are 
some optimization that I decide to do in future. For example, some 
considerations about UTF-16, SAX support for big HTML documents and some 
performance related issues are in my TODO list.

To get rid of downside of this option maybe there would be a solution. For 
example, instead of depending to icu4j in my pom, I can either …
- copy/paste the few needed classes of icu4j into my source code, or
- add {{tika-xxx}} dependency (current fixed version that contains icu4j 
classes) with {{provided}} scope into my pom. (I’m not sure if {{provided}} can 
resolve circular dependency or not)

Please let me know what you think? If this is a reasonable solution?

Out of curiosity, I didn't traced the entire code of Tika but it seems that you 
are currently use icu4j 
[somewhere|https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/utils/CharsetUtils.java#L90]
 in your code as an optional dependency! Also I've seen a piece of code that 
was just like this one in Apache PDFBox in which icu4j was used for text 
normalization, before.

Sure, but note that the charset in HTTP header of a requested page was my 
criteria to get that page in a category. For validation check I've visually 
inspected some of these documents but didn't see any problem. You now that 
though the charsets in HTTP headers are not thoroughly foolproof but they are 
the only reasonable criteria that are available. If you add my html test set to 
your testing corpus, let me know the address of it, thanks.


was (Author: faghani):
My preference is option 1).
In this case I can freely work on my library and then just announce you version 
number of the latest release. Because, although it works now fine but there are 
some optimization that I decide to do in future. For example, some 
considerations about UTF-16, SAX support for big HTML documentss and some 
performance related issues are in my TODO list.

To get rid of downside of this option maybe there would be a solution. For 
example, instead of depending to icu4j in my pom, either I can …
- copy/paste the few needed classes of icu4j into my source code, or
- add {{tika-xxx}} dependency (current fixed version that contains icu4j 
classes) with {{provided}} scope into my pom. (I’m not sure if {{provided}} can 
resolve circular dependency or not)

Please let me know what you think? If this is a reasonable solution?

Out of curiosity, I didn't traced the entire code of Tika but it seems that you 
are currently use icu4j 
[somewhere|https://github.com/apache/tika/blob/master/tika-core/src/main/java/org/apache/tika/utils/CharsetUtils.java#L90]
 in your code as an optional dependency! Also I've seen a piece of code that 
was just like this one in Apache PDFBox in which icu4j were used for text 
normalization, before.

Sure, but note that the charset in HTTP header of a requested page was my 
criteria to get that page in a category. For validation check I've visually 
inspected some of these documents but didn't see any problem. You now that 
though the charsets in HTTP headers are not thoroughly foolproof but they are 
the only reasonable criteria that are available. If you add my html test set to 
your testing corpus, let me know the address of it, thanks.

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Comment Edited] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to