Nutch is intended to handle large collections. The simplest way to get hold of large collections is to simply search the web.But Nutch is not just a web search engine. It also provides distributed creation of indexes and distributed search which is the motivation of my comment about it being the networked version of Lucene. So, while I agree with your statement that Nutch was "especially designed to deal with web documents", but would strongly disagree that this is a limitation. For one thing, if you actually have gobs of documents, you probably will have to store them in a networked form somehow. That networked form is probably pretty easy to make accessible via HTTP and that makes a web-oriented search engine like Nutch just what you need. Another way to say this is that is if you need a general purpose networked/distributed search engine and you have a web-oriented distributed search engine, you can either adapt the search engine to not be web oriented, or you can adapt your collection to be web-oriented. On 7/18/07 8:32 AM, "Samuel LEMOINE" <[EMAIL PROTECTED]> wrote:You quote Nutch as being "the networked version of Lucene", but from what I've seen it's more precise than that, especially designed to deal with web documents... am I wrong assuming this ?
I'm well aware of the 2 possibilities you're proposing, but I don't
think it would fit with the existing software of the company I'm working
in. I guess I'll have to crawl among Nutch's guts to find what I'm
looking for, and export it. Once I'll have managed this, I'll try to
make the tutorial that today lacks for me.
- lucene with hadoop but without nutch, looking for documenta... Samuel LEMOINE
- Re: lucene with hadoop but without nutch, looking for ... Ted Dunning
- Re: lucene with hadoop but without nutch, looking ... Samuel LEMOINE
- Re: lucene with hadoop but without nutch, look... Ted Dunning
- Re: lucene with hadoop but without nutch, ... Samuel LEMOINE
