Dear all, I have been looking for a Java implementation of Google's MapReduce design and was very glad to find Nutch. However, I don't want to use it for web crawling: I want to experiment with Nutch's MapReduce as a method for (distributed) searching through some existing, very large datasets that I have stored in an NFS filesystem.
I've just had a quick try at getting started, armed with the Wiki (http://wiki.apache.org/nutch/FAQ#head-48f8d8319c3c85953118721f42336613abf7f 6b6), Tom White's blog (http://weblogs.java.net/blog/tomwhite/archive/2005/09/mapreduce.html) and the source code. (I checked out the 0.7 branch from http://svn.apache.org/repos/asf/lucene/nutch/branches/branch-0.7/). I tried running the Grep application (using Cygwin under WinXP): cd nutch/bin ./nutch org.apache.nutch.mapReduce.demo.Grep c:\in c:\out 'A|C' I get messages saying it is parsing nutch-default.xml and nutch-site.xml (in fact I get these messages twice each), then I get a java.net.ConnectException that traces to mapReduce.JobClient.getFs(JobClient.java:195). I can't figure out what it's trying to connect to - maybe it's trying to find a NDFS instance? I just want to run everything on my local machine for now. I think that some of the material on the web is out of date (it refers to a mapred package, not the mapReduce package for example), which is fine because I understand that this is still under development. However, if someone could give me some pointers for using MapReduce in "standalone" mode (i.e. without using NDFS or doing web crawls) I'd be extremely grateful. Thanks in advance, Jon Blower, University of Reading, UK