Re: [Nutch-general] Alternatives

Jason Calabrese Thu, 06 Jul 2006 08:17:14 -0700

You can startup a crawler by just creating a job.  You can basicly just 
copy/tweek some code from the main method in org.apache.nutch.crawl.Crawl.

In my application we are working at a lower level and first create the 
crawl_generate dir, then start a fetch, then we parse the fetched results, 
and then index the parsed results on our own.

I think with a little hacking you can make use of a lot of the nutch/hadoop 
code in any framework.  

The only place I had problems was getting the nutch progress integrated into 
our application so that an admin can see where the fetcher is within our 
application.  In order to do that I added a few hacks to the fetcher and 
parser, but I think there may be a better.

On Wednesday 05 July 2006 8:58 am, karl wettin wrote:
> I have never looked at how Nutch works, nor have I used it. My questions
> might just be RTFM-related.
>
> Lately people have asked me to help them out with simple domainspecific
> webindexing services. The requirements are, as usual when I'm involved,
> to run on very limited resources. What I did is to combine my very
> simple and minimalistic servlet engine <http://sf.net/project/servlet>
> with Lucene and NekoHTML, extracting only the the content "frame" from
> the static design of the site.
>
> This made me think of two things:
>
> It would be nice to use the features of Nutch instead of my own hacky
> stuff. How bound is Nutch to the J2EE-container? Would it be a big job
> to make it run on an alternative GUI? Or is is the container used for
> more than GUI? I.e. do all services (crawler, et.c.) run within the
> container? Do they have to?
>
> It would be nice to automatically detect the content "frame" by
> analyzing the DOM tree of the pages on a site. Is there such a feature
> in Nutch, contributed to, or publicly available in some other project?
> I'd be more than happy do discuss, write and contribute it back if I end
> up making one.

Re: [Nutch-general] Alternatives

Reply via email to