Soooooo...

Executive Summary: Funnel all Files thru a webserver if you want page weighting (OPIC?) and anchor text to be indexed/used.

I just did some experiments after unsuccessfully trying to invertlinks on the segments built from file:///enwiki/. (I'd originally crawled with ignore internal turned on).

Seems that for the intranet crawl, using file:/// as the hierarchy rather than http://localhost/somelinktosamefiles/ results in no OPIC scoring and no Anchor text. You also have to disable the <db.ignore.internal.links> by setting it to false.

My initial thought is that crawling the file system should be faster than pulling the same files thru http://localhost/. Examining the hadoop log, the times for the same set of 889 pages are 5 and 7 seconds respectively. Same machine, with http://localhost/ having the potential advantage of the file:/// being in cache.

Any explanations from anyone? Comments?

Cheers,
 Winton





Reply via email to