You can startup a crawler by just creating a job. You can basicly just copy/tweek some code from the main method in org.apache.nutch.crawl.Crawl.
In my application we are working at a lower level and first create the crawl_generate dir, then start a fetch, then we parse the fetched results, and then index the parsed results on our own. I think with a little hacking you can make use of a lot of the nutch/hadoop code in any framework. The only place I had problems was getting the nutch progress integrated into our application so that an admin can see where the fetcher is within our application. In order to do that I added a few hacks to the fetcher and parser, but I think there may be a better. On Wednesday 05 July 2006 8:58 am, karl wettin wrote: > I have never looked at how Nutch works, nor have I used it. My questions > might just be RTFM-related. > > Lately people have asked me to help them out with simple domainspecific > webindexing services. The requirements are, as usual when I'm involved, > to run on very limited resources. What I did is to combine my very > simple and minimalistic servlet engine <http://sf.net/project/servlet> > with Lucene and NekoHTML, extracting only the the content "frame" from > the static design of the site. > > This made me think of two things: > > It would be nice to use the features of Nutch instead of my own hacky > stuff. How bound is Nutch to the J2EE-container? Would it be a big job > to make it run on an alternative GUI? Or is is the container used for > more than GUI? I.e. do all services (crawler, et.c.) run within the > container? Do they have to? > > It would be nice to automatically detect the content "frame" by > analyzing the DOM tree of the pages on a site. Is there such a feature > in Nutch, contributed to, or publicly available in some other project? > I'd be more than happy do discuss, write and contribute it back if I end > up making one.
