Re: [compress] FW: Tika content detection and crawled "remote" content

Stefan Bodewig Wed, 05 Jul 2017 06:36:08 -0700

This looks great, well done Tika!

Thank you for sharing, Tim


Cheers

      Stefan

On 2017-07-05, Allison, Timothy B. wrote:

> Fellow file-philes on [compress],

> Sebastian Nagel has added file type id via Apache Tika to Common Crawl.  
> While Tika is not 100% accurate, this means that we have far better clarity 
> on mime type than relying on the http header+file suffix.  So, for testing 
> purposes, you (or we over on Tika) can much more easily gather a small test 
> corpus of files by mime type.

> Many, many thanks to Sebastian and Common Crawl!

>           Cheers,

>                       Tim

> -----Original Message-----
> From: Sebastian Nagel [mailto:[email protected]]
> Sent: Tuesday, July 4, 2017 6:18 AM
> To: [email protected]
> Subject: Tika content detection and crawled "remote" content

> Hi,

> recently I've plugged in Tika's content detection into Common Crawl's crawler 
> (modified Nutch) with the target to get clean and correct MIME type - the 
> HTTP Content-Type may contain garbage and isn't always correct [1].

> For the June 2017 crawl I've prepared a comparison of content types sent by 
> the server in the HTTP header and as detected by Tika 1.15 [2].  It shows 
> that content types by Tika are definitely clean
> (1,400 different content types vs. more than 6,000 content type "strings" 
> from HTTP headers).

> A look on the "confusions" where Content-Type and Tika differ, shows a mixed 
> picture: some pairs are plausible, e.g., if Tika changes the type to a more 
> precise subtype or detects the MIME at all:

>             Tika-1.15                HTTP-Content-Type
> 1001968023  application/xhtml+xml    text/html
>    2298146  application/rss+xml      text/xml
>     617435  application/rss+xml      application/xml
>     613525  text/html                unk
>     361525  application/xhtml+xml    unk
>     297707  application/rdf+xml      application/xml


> However, there are a few dubious decisions, esp. the group of web server-side 
> scripting languages (ASP, JSP, PHP, ColdFusion, etc.):

>          Tika-1.15         HTTP-Content-Type
> 2047739  text/x-php        text/html
>  681629  text/asp          text/html
>  193095  text/x-coldfusion text/html
>  172318  text/aspdotnet    text/html
>  139033  text/x-jsp        text/html
>   38415  text/x-cgi        text/html
>   32092  text/x-php        text/xml
>   18021  text/x-perl       text/html

> Of course, due to misconfigurations some servers may deliver the script files 
> unmodified but in general I wouldn't expect that this happens for millions of 
> pages.  I've checked some of the affected URLs:

> - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag)

> https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0
>     http://www.privi.com/product-details.asp?cno=C10910011
>     http://mental-ray.de/Root_alt/Default.asp
>     http://ekyrs.org/support/index.php?action=profile
>     http://cwmorse.eu5.org/lineal/mostrar.php?contador=200

> - (overlong) comment block at start of HTML which "masks" the HTML declaration
>     http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24

> http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6
>     
> https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php
>     https://de.e-stories.org/categories.php?&lan=nl&art=p

> - HTML with some scripting fragments ("<?php?>") present:
>     http://www.eco-ani-yao.org/shien/

> - others are clearly HTML (looks more like a bug, at least, there is no 
> simple explanation)
>     http://www.proedinc.com/customer/content.aspx?redid=9
>     
> http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79
>     http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact
>     http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79


> Obviously certain file suffixes (.php, .aspx) should get less weight compared 
> to Content-Type sent from the responding server.
> Now my question: where's the best place to fix this: in the crawler [3] or in 
> Tika?

> If anyone is interested in using the detected MIME types or anything else 
> from Common Crawl - I'm happy to help!  The URL index [4] contains now a new 
> field "mime-detected" which makes it easy to search or grep for confusion 
> pairs.


> Thanks and best,
> Sebastian


> [1] https://github.com/commoncrawl/nutch/issues/3
> [2] 
> s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz

> https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz
> [3] 
> https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152
> [4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/


> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [compress] FW: Tika content detection and crawled "remote" content

Reply via email to