Extracting homepage content

Euan Clark Wed, 21 Jan 2009 19:34:19 -0800

Hi All,

I have a list of around 100k {URLs, segment} that I am certain exist inthe crawlset.


I want to extract only the page content of particular URLs.

Looked at:

nutch readseg -get <segment dir> "<url>" -nofetch -nogenerate -noparse -noparsedata -noparsetext

This is taking roughly 5 seconds per URL within a content dir of around250MB.

Same amount of time if the segment is sitting in /dev/shm (RAM) or diskso it's looking like process overhead.

I'm guessing that this is due to Java initialisation.

At 100k URLs this extrapolates out to a run time of around 28 hrs.

Is there a faster way to do random access grabbing of page content?

(Nutch 0.9)

Extracting homepage content

Reply via email to