I'm well aware of the 2 possibilities you're proposing, but I don't think it would fit with the existing software of the company I'm working in. I guess I'll have to crawl among Nutch's guts to find what I'm looking for, and export it. Once I'll have managed this, I'll try to make the tutorial that today lacks for me.
Nutch is intended to handle large collections.  The simplest way to get hold
of large collections is to simply search the web.

But Nutch is not just a web search engine.  It also provides distributed
creation of indexes and distributed search which is the motivation of my
comment about it being the networked version of Lucene.

So, while I agree with your statement that Nutch was "especially designed to
deal with web documents", but would strongly disagree that this is a
limitation.  For one thing, if you actually have gobs of documents, you
probably will have to store them in a networked form somehow.  That
networked form is probably pretty easy to make accessible via HTTP and that
makes a web-oriented search engine like Nutch just what you need.

Another way to say this is that is if you need a general purpose
networked/distributed search engine and you have a web-oriented distributed
search engine, you can either adapt the search engine to not be web
oriented, or you can adapt your collection to be web-oriented.


On 7/18/07 8:32 AM, "Samuel LEMOINE" <[EMAIL PROTECTED]> wrote:

You quote Nutch as being "the networked version of Lucene", but from
what I've seen it's more precise than that, especially designed to deal
with web documents... am I wrong assuming this ?



Reply via email to