Hi all,

I'd like to discuss what is the best way forward to merging the nutchbase code with trunk.

First some important facts:

* nutchbase is almost totally API incompatible with Nutch 1.x. While the main ideas remain the same, and most of the tools remain as well, their implementation is very different (and let me say, much cleaner) than that of Nutch 1.x. E.g. while nutchbase uses URLFilters and URLNormalizers, and IndexingFilters, etc, their method signatures have changed. To give you some idea how deep these changes go, let me say that CrawlDatum is gone now.

* for the last month or so, and I foresee for another month or so, Julien, Dogacan, myself and Enis have been working on bringing nutchbase (and Gora) as much up-to-date with trunk as possible - in fact, you could say we have been merging trunk to nutchbase... The original reason for this was that we first wanted to bring nutchbase into a working state and then start merging, but also another important reason was the one mentioned above - we didn't know how to prepare a meaningful patch for trunk that wouldn't replace 90+ % of the code in trunk...

So, I would like to propose an alternative strategy: we will keep merging from trunk to nutchbase, with proper JIRA tracking (I created a 'nutchbase' tag in JIRA), and once we reach a state when nutchbase offers roughly the same functionality as the code in trunk then we simply switch nutchbase with trunk.

Current status of nutchbase is that the basic tools to implement a crawling workflow have been ported and work correctly, and we are able to execute a few unit tests on an SQL backend.

Regarding backwards-compatibility with Nutch 1.x: most config files are unchanged, and we should probably offer some data migration tools - I'm not sure whether it makes sense to create a segment converter, but we can certainly create a CrawlDb converter.

What do you think? Any comments / suggestions / ideas?

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to