Hi, I'm not sure what is the status of the nutchbase - it's missed a lot of > fixes and changes in trunk since it's been last touched ... >
yes, maybe we should start the 2.0 branch from 1.1 instead Dogacan - what do you think? BTW I see there is now a 2.0 label under JIRA, thanks to whoever added it > Also, the goal of the crawler-commons project is to provide APIs and > implementations of stuff that is needed for every open source crawler > project, like: robots handling, url filtering and url normalization, URL > state management, perhaps deduplication. We should coordinate our > efforts, and share code freely so that other projects (bixo, heritrix, > droids) may contribute to this shared pool of functionality, much like > Tika does for the common need of parsing complex formats. > definitely +1 - we may still keep a thin abstract layer to allow other > indexing/search backends, but the current mess of indexing/query filters > and competing indexing frameworks (lucene, fields, solr) should go away. > We should go directly from DOM to a NutchDocument, and stop there. > I think that separating the parsing filters from the indexing filters can have its merits e.g. combining the metadata generated by 2 or more different parsing filters into a single field in the NutchDocument, keeping only a subset of the available information etc... > > > > I suppose that http://wiki.apache.org/nutch/Nutch2Architecture needs an > > update? > Have created a new page to serve as a support for discussion : http://wiki.apache.org/nutch/Nutch2Roadmap julien -- DigitalPebble Ltd http://www.digitalpebble.com