Hi Andrzej, Great summary. My general feeling on this is similar to my prior comments on similar threads from Otis and from Dennis. My personal pet projects for Nutch2:
* refactored Nutch core data structures, modeled as POJOs * refactored Nutch architecture where crawling/indexing/parsing/scoring/etc. are insulated from the underlying messaging substrate (e.g., crawl over JMS, EJB, Hadoop, RMI, etc., crawl using Heretix, parse using Tika or some other framework, etc.) * simpler Nutch deployment mechanisms (separate Nutch deployment package from source code package), think about using Maven2 +1 to all of those and other ideas for how to improve the project's focus. Cheers, Chris On 5/14/09 6:45 AM, "Andrzej Bialecki" <a...@getopt.org> wrote: > Hi all, > > I'd like to revive this thread and gather additional feedback so that we > end up with concrete conclusions. Much of what I write below others have > said before, I'm trying here to express this as it looks from my point > of view. > > Target audience > =============== > I think that the Nutch project experiences a crisis of personality now - > we are not sure what is the target audience, and we cannot satisfy > everyone. I think that there are following groups of Nutch users: > > 1. Large-scale Internet crawl & search: actually, there are only few > such users, because it takes considerable resources to manage operations > on that scale. Scalability, manage-ability and ranking/spam prevention > are the chief concerns here. > > 2. Medium-scale vertical search: I suspect that many Nutch users fall > into this category. Modularity, flexibility in implementing custom > processing, ability to modify workflows and to use only some Nutch > components seem to be chief concerns here. Scalability too, but only up > to a volume of ~100-200 mln documents. > > 3. Small- to medium-scale enterprise search: there's a sizeable number > of Nutch users that fall into this category, for historical reasons. > Link-based ranking and resource discovery are not that important here, > but integration with Windows networking, Microsoft formats and databases > , as well as realtime indexing and easy index maintenance are crucial. > This class of users often has to heavily customize Nutch to get any > sensible result. Also, this is where Solr really shines, so there is > little benefit in using Nutch here. I predict that Nutch will have fewer > and fewer users of this type. > > 4. Single desktop to small intranet search: as above, but the accent is > on the ease of use out of the box, and an often requested feature is a > GUI frontend. Currently IMHO Nutch is too complex and requires too much > command-line operation for casual users to make this use case attractive. > > What is the target audience that we as a community want to support? By > this I mean not only the moral support, but also active participation in > the development process. From the place where we are at the moment we > could go in any of the above directions. > > Core competence > =============== > This is a simple but important point. Currently we maintain several > major subsystems in Nutch that are implemented by other projects, and > often in a better way. Plugin framework (and dependency injection) and > content parsing are two areas that we have to delegate to third-party > libraries, such as Tika and OSGI or some other simple IOC container - > probably there are other components that we don't have to do ourselves. > Another thing that I'd love to delegate is the distributed search and > index maintenance - either through Solr or Katta or something else. > > The question then is, what is the core competence of this project? I see > the following major areas that are unique to Nutch: > > * crawling - this includes crawl scheduling (and re-crawl scheduling), > discovery and classification of new resources, strategies for crawling > specific sets of URLs (hosts and domains) under bandwidth and netiquette > constraints, etc. > > * web graph analysis - this includes link-based ranking, mirror > detection (and URL "aliasing") but also link spam detection and a more > complex control over the crawling frontier. > > Anything more? I'm not sure - perhaps I would add template detection and > pagelet-level crawling (i.e. sensible re-crawling of portal-type sites). > > Nutch 1.0 already made some steps in this direction, with the new link > analysis package and pluggable FetchSchedule and Signature. A lot > remains to be done here, and we are still spending a lot of resources on > dealing with issues outside this core competence. > > ------- > > So, what do we need to do next? > > * we need to decide where we should commit our resources, as a community > of users, contributors and committers, so that the project is most > useful to our target audience. At this point there are few active > committers, so I don't think we can cover more than 1 direction at a > time ... ;) > > * we need to re-architect Nutch to focus on our core competence, and > delegate what we can to other projects. > > Feel free to comment on the above, make suggestions or corrections. I'd > like to wrap it up in a concise mission statement that would help us set > the goals for the next couple months. > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Chris Mattmann, Ph.D. Senior Computer Scientist NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA Office: 171-266B, Mailstop: 171-246 Email: chris.mattm...@jpl.nasa.gov WWW: http://sunset.usc.edu/~mattmann/ ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ Adjunct Assistant Professor, Computer Science Department University of Southern California, Los Angeles, CA 90089 USA ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++