On Tue, 2005-09-20 at 19:07 -0700, Doug Cutting wrote: > What version of Nutch are you using? > > The version of NDFS in the mapred branch is much improved. The crawling > code in that branch has also been re-written to be MapReduce-based, and > will automatically manage multi-machine fetching, db updates, indexing, etc.
I haven't looked at it and wasn't concerned until I saw the "automatically" but will we still be able to crawl and not index? Secondly, will it still be possible to get the output dumped (ie. segread -dump) to a flat file in large chunks? We use Nutch as a crawler only, then after taking a dump of the data we remove the segment from the filesystem. This means we only have a couple hundred GB of data around at any given time. We do our crawling, db updates, etc. in one environment then post-process the HTML retrieved in large chunks (segments of about 200k pages) within another environment. -- Rod Taylor <[EMAIL PROTECTED]>
