All, I'm thinking about using some parts of Nutch in my projects. We have been using Lucene for a while with great success, but I now want to change the way we store our content (mostly xml documents).
In the past we've stored the content in MySQL and now use simple gzipped xml files. Removing the dependency on MySQL has been nice, but dealing with millions of small files has created obvious problems. It looks like using the NDFS and MapFile/ArrayFile's could be part of a good solution for us. I'm also interested in using the MapReduce framework and possibly the Fetcher in our applications. Is there anyone else using just parts of Nutch? Is it planned that the api will stay fairly stable? Have there been thoughts or discusions about breaking parts of Nutch out in to more general toolkits? --Jason
