On Tue, 2005-09-20 at 19:07 -0700, Doug Cutting wrote:
> What version of Nutch are you using?
> 
> The version of NDFS in the mapred branch is much improved.  The crawling 
> code in that branch has also been re-written to be MapReduce-based, and 
> will automatically manage multi-machine fetching, db updates, indexing, etc.

I haven't looked at it and wasn't concerned until I saw the
"automatically" but will we still be able to crawl and not index?

Secondly, will it still be possible to get the output dumped (ie.
segread -dump) to a flat file in large chunks?


We use Nutch as a crawler only, then after taking a dump of the data we
remove the segment from the filesystem. This means we only have a couple
hundred GB of data around at any given time.

We do our crawling, db updates, etc. in one environment then
post-process the HTML retrieved in large chunks (segments of about 200k
pages) within another environment. 

-- 
Rod Taylor <[EMAIL PROTECTED]>

Reply via email to