Hi all I'm new to Nutch, and turned to it to obtain a setup along the following lines:
We want a remote machine, running nutch (?), that we can incrementally feed URLs to, and access the index and raw content of the crawled version of those URLs. It seems to me that nutch is what we need, but I am struggling to get some concepts clear: 1/ Access to results of nutch running on remote machine: a workstation needs to access the results of the nutch server, possible in high troughput (fact extraction). What is the right way for a remote machine to access all the nutch functionality on the nutch server? search, get content, feed new urls, recrawl, etc ? Is it provided by the webapp part of nutch ? Nutchbean ? Or is that limited to local access ? 2/ Recrawling: Everywhere I read, this seems to be a diffucult issue? What setup should I aim for ? Does a recrawl best run from a script on the server ? Can I easily limit the recrawl to the newly added URLs ? We envisage an incremental system that gets more and more URLs as it is used and processes the content of those pages. At this point I dont even know if that is a feasible goal. 3/ Extending nutch: Something we want is storage of some extra fields in the index : to this point this seems like a matter of writing a plugin along the lines of http://wiki.apache.org/nutch/WritingPluginExample-0.9 is that the correct direction ? Would be great if someone could push me in the right directions - spending a lot of time going trough (old) documentation which sometimes seems conflicting. Thanks ! Piet -- -- PvR