Hi All,

I have a list of around 100k {URLs, segment} that I am certain exist in the crawlset.

I want to extract only the page content of particular URLs.

Looked at:

nutch readseg -get <segment dir> "<url>" -nofetch -nogenerate -noparse - noparsedata -noparsetext

This is taking roughly 5 seconds per URL within a content dir of around 250MB.

Same amount of time if the segment is sitting in /dev/shm (RAM) or disk so it's looking like process overhead.
I'm guessing that this is due to Java initialisation.

At 100k URLs this extrapolates out to a run time of around 28 hrs.

Is there a faster way to do random access grabbing of page content?

(Nutch 0.9)


Reply via email to