[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Shabanali Faghani (JIRA) Thu, 04 Aug 2016 02:20:32 -0700

    [ 
https://issues.apache.org/jira/browse/TIKA-2038?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15407465#comment-15407465
 ]


Shabanali Faghani commented on TIKA-2038:
-----------------------------------------

Oh, that’s a misunderstanding! The [*HTTP Header*| 
https://en.wikipedia.org/wiki/List_of_HTTP_header_fields] is something 
different from *Meta tags* (*metaheader* in your comments) and is available 
just in online mode, i.e. when one is fetching or crawling html pages. You can 
get HTTP header of a page simply by using a line of code like 
[this|https://github.com/shabanali-faghani/IUST-HTMLCharDet/blob/master/src/test/java/languagewise/LangCrawlThread.java#L52].
 It is provided by HTTP servers and as I’ve seen, almost in half cases its 
charset field has valid information… and I've used this information as ground 
truth in my tests (both encoding-wise and language-wise). Note that as I’ve 
stated before, I turned off Meta tags or Metaheader detection for the both 
tests.

But why I used charset in HTTP header?
* for any test in any context you need to at least one criteria/validity 
measure/ground truth/ruler/ …
* charsets available in HTTP headers, Meta tags and Visual Inspection are the 
only available validity measures in this context
* it is trivial that Visual Inspection is almost impossible for large 
collections
* for using charsets in Meta tags you should fetch or download a page at first 
then look for a charset in its Meta elements/tags
* note that I haven’t any test file at first, so I’ve been forced to collect a 
corpus
* ... as you know there are many html pages that haven’t charset in their Meta 
tags, so it is trivial that choosing charset in HTTP header is more efficient, 
because in this case instead of downloading a lot of pages that haven’t the 
required charset information and then throwing them out; I just get a small 
descriptor for each of them.

It should also be noted that:
* in many cases if charset information is available in HTTP header it also 
there exists in Meta tags. For example, in my first corpus 2,614 of the all 
2,675 docs have charset in their Meta tags, i.e. except for 61 docs.
* all of these 2,675 docs had charset in their HTTP header (when I crawled 
them) 
* the name of each sub-directory in my corpus is the same charset name that was 
seen in HTTP header of the all docs inside it

Moreover, charsets in HTTP header and charsets in Meta tags are not necessarily 
the same. For example ...
sub-directory: Windows-1256
total docs: 645
#docs that have charset in Meta tags: 615
#docs that haven't any charset in Meta tags: 30
More details:
||Charset in Meta tags||Count||
|windows-1256|596|
|windows-1252|8|
|utf-8|2|
|iso-8859-1|4|
|Windows-1256|2|
|WINDOWS-1256|1|

Another question: Why I turned Meta tags detection off?
I think answer of this question is very easy but if it would be necessary I 
will explain it more.

> A more accurate facility for detecting Charset Encoding of HTML documents
> -------------------------------------------------------------------------
>
>                 Key: TIKA-2038
>                 URL: https://issues.apache.org/jira/browse/TIKA-2038
>             Project: Tika
>          Issue Type: Improvement
>          Components: core, detector
>            Reporter: Shabanali Faghani
>            Priority: Minor
>         Attachments: comparisons_20160803b.xlsx, iust_encodings.zip, 
> tika_1_14-SNAPSHOT_encoding_detector.zip
>
>
> Currently, Tika uses icu4j for detecting charset encoding of HTML documents 
> as well as the other naturally text documents. But the accuracy of encoding 
> detector tools, including icu4j, in dealing with the HTML documents is 
> meaningfully less than from which the other text documents. Hence, in our 
> project I developed a library that works pretty well for HTML documents, 
> which is available here: https://github.com/shabanali-faghani/IUST-HTMLCharDet
> Since Tika is widely used with and within some of other Apache stuffs such as 
> Nutch, Lucene, Solr, etc. and these projects are strongly in connection with 
> the HTML documents, it seems that having such an facility in Tika also will 
> help them to become more accurate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (TIKA-2038) A more accurate facility for detecting Charset Encoding of HTML documents

Reply via email to