incremental nutch crawl on remote machine

Piet van Remortel Tue, 20 Apr 2010 23:15:15 -0700

Hi all

I'm new to Nutch, and turned to it to obtain a setup along the following
lines:


We want a remote machine, running nutch (?), that we can incrementally feed
URLs to, and access the index and raw content of the crawled version of
those URLs.

It seems to me that nutch is what we need, but I am struggling to get some
concepts clear:

1/  Access to results of nutch running on remote machine:   a workstation
needs to access the results of the nutch server, possible in high troughput
(fact extraction).  What is the right way for a remote machine to access all
the nutch functionality on the nutch server?  search, get content, feed new
urls, recrawl, etc ?  Is it provided by the webapp part of nutch ?
 Nutchbean ?  Or is that limited to local access ?

2/  Recrawling:  Everywhere I read, this seems to be a diffucult issue?
 What setup should I aim for ?  Does a recrawl best run from a script on the
server ?  Can I easily limit the recrawl to the newly added URLs ?  We
envisage an incremental system that gets more and more URLs as it is used
and processes the content of those pages.  At this point I dont even know if
that is a feasible goal.

3/  Extending nutch:  Something we want is storage of some extra fields in
the index : to this point this seems like a matter of writing a plugin along
the lines of http://wiki.apache.org/nutch/WritingPluginExample-0.9 is that
the correct direction ?

Would be great if someone could push me in the right directions - spending a
lot of time going trough (old) documentation which sometimes seems
conflicting.

Thanks !

Piet



-- 
--
PvR

incremental nutch crawl on remote machine

Reply via email to