This looks great, well done Tika! Thank you for sharing, Tim
Cheers Stefan On 2017-07-05, Allison, Timothy B. wrote: > Fellow file-philes on [compress], > Sebastian Nagel has added file type id via Apache Tika to Common Crawl. > While Tika is not 100% accurate, this means that we have far better clarity > on mime type than relying on the http header+file suffix. So, for testing > purposes, you (or we over on Tika) can much more easily gather a small test > corpus of files by mime type. > Many, many thanks to Sebastian and Common Crawl! > Cheers, > Tim > -----Original Message----- > From: Sebastian Nagel [mailto:wastl.na...@googlemail.com] > Sent: Tuesday, July 4, 2017 6:18 AM > To: u...@tika.apache.org > Subject: Tika content detection and crawled "remote" content > Hi, > recently I've plugged in Tika's content detection into Common Crawl's crawler > (modified Nutch) with the target to get clean and correct MIME type - the > HTTP Content-Type may contain garbage and isn't always correct [1]. > For the June 2017 crawl I've prepared a comparison of content types sent by > the server in the HTTP header and as detected by Tika 1.15 [2]. It shows > that content types by Tika are definitely clean > (1,400 different content types vs. more than 6,000 content type "strings" > from HTTP headers). > A look on the "confusions" where Content-Type and Tika differ, shows a mixed > picture: some pairs are plausible, e.g., if Tika changes the type to a more > precise subtype or detects the MIME at all: > Tika-1.15 HTTP-Content-Type > 1001968023 application/xhtml+xml text/html > 2298146 application/rss+xml text/xml > 617435 application/rss+xml application/xml > 613525 text/html unk > 361525 application/xhtml+xml unk > 297707 application/rdf+xml application/xml > However, there are a few dubious decisions, esp. the group of web server-side > scripting languages (ASP, JSP, PHP, ColdFusion, etc.): > Tika-1.15 HTTP-Content-Type > 2047739 text/x-php text/html > 681629 text/asp text/html > 193095 text/x-coldfusion text/html > 172318 text/aspdotnet text/html > 139033 text/x-jsp text/html > 38415 text/x-cgi text/html > 32092 text/x-php text/xml > 18021 text/x-perl text/html > Of course, due to misconfigurations some servers may deliver the script files > unmodified but in general I wouldn't expect that this happens for millions of > pages. I've checked some of the affected URLs: > - HTML fragment (no declaration of <!DOCTYPE...> or <html> opening tag) > https://www.projectmanagement.com/profile/profile_contributions.cfm?profileID=46773580&popup=&c_b=0&c_mb=0&c_q=0&c_a=2&c_r=1&c_bc=1&c_wc=0&c_we=0&c_ar=0&c_ack=0&c_v=0&c_d=0&c_ra=2&c_p=0 > http://www.privi.com/product-details.asp?cno=C10910011 > http://mental-ray.de/Root_alt/Default.asp > http://ekyrs.org/support/index.php?action=profile > http://cwmorse.eu5.org/lineal/mostrar.php?contador=200 > - (overlong) comment block at start of HTML which "masks" the HTML declaration > http://www.mannheim-virtuell.de/index.php?branchenID=2&rubrikID=24 > http://www.exoduschurch.org/bbs/view.php?id=sunday_school&page=1&sn1=&divpage=1&sn=off&ss=on&sc=on&select_arrange=headnum&desc=asc&no=6 > > https://www.preventiongenetics.com/About/Resources/disease/MarfansSyndrome.php > https://de.e-stories.org/categories.php?&lan=nl&art=p > - HTML with some scripting fragments ("<?php?>") present: > http://www.eco-ani-yao.org/shien/ > - others are clearly HTML (looks more like a bug, at least, there is no > simple explanation) > http://www.proedinc.com/customer/content.aspx?redid=9 > > http://cball.dyndns.org/wbb2/board.php?boardid=8&sid=bf3b7971faa23413fa1164be0c068f79 > http://eusoma.org/Engx/Info/ContactUs.aspx?cont=contact > http://cball.dyndns.org/wbb2/map.php?sid=bf3b7971faa23413fa1164be0c068f79 > Obviously certain file suffixes (.php, .aspx) should get less weight compared > to Content-Type sent from the responding server. > Now my question: where's the best place to fix this: in the crawler [3] or in > Tika? > If anyone is interested in using the detected MIME types or anything else > from Common Crawl - I'm happy to help! The URL index [4] contains now a new > field "mime-detected" which makes it easy to search or grep for confusion > pairs. > Thanks and best, > Sebastian > [1] https://github.com/commoncrawl/nutch/issues/3 > [2] > s3://commoncrawl-dev/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz > https://commoncrawl-dev.s3.amazonaws.com/tika-content-type-detection/content-type-diff-tika-1.15-cc-main-2017-26.txt.xz > [3] > https://github.com/apache/nutch/blob/master/src/java/org/apache/nutch/util/MimeUtil.java#L152 > [4] http://commoncrawl.org/2015/04/announcing-the-common-crawl-index/ > --------------------------------------------------------------------- > To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org > For additional commands, e-mail: dev-h...@commons.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@commons.apache.org For additional commands, e-mail: dev-h...@commons.apache.org