nutch crawl : file:/// vs http://localhost/

Winton Davies Tue, 01 Jul 2008 12:17:49 -0700

Soooooo...

Executive Summary: Funnel all Files thru a webserver if you want pageweighting (OPIC?) and anchor text to be indexed/used.

I just did some experiments after unsuccessfully trying toinvertlinks on the segments built from file:///enwiki/. (I'doriginally crawled with ignore internal turned on).

Seems that for the intranet crawl, using file:/// as the hierarchyrather than http://localhost/somelinktosamefiles/ results in no OPICscoring and no Anchor text. You also have to disable the<db.ignore.internal.links> by setting it to false.

My initial thought is that crawling the file system should be fasterthan pulling the same files thru http://localhost/. Examining thehadoop log, the times for the same set of 889 pages are 5 and 7seconds respectively. Same machine, with http://localhost/ having thepotential advantage of the file:/// being in cache.


Any explanations from anyone? Comments?

Cheers,
 Winton

nutch crawl : file:/// vs http://localhost/

Reply via email to