Nutchbase merge strategy

Andrzej Bialecki Wed, 21 Jul 2010 11:27:21 -0700

Hi all,

I'd like to discuss what is the best way forward to merging thenutchbase code with trunk.


First some important facts:

* nutchbase is almost totally API incompatible with Nutch 1.x. While themain ideas remain the same, and most of the tools remain as well, theirimplementation is very different (and let me say, much cleaner) thanthat of Nutch 1.x. E.g. while nutchbase uses URLFilters andURLNormalizers, and IndexingFilters, etc, their method signatures havechanged. To give you some idea how deep these changes go, let me saythat CrawlDatum is gone now.

* for the last month or so, and I foresee for another month or so,Julien, Dogacan, myself and Enis have been working on bringing nutchbase(and Gora) as much up-to-date with trunk as possible - in fact, youcould say we have been merging trunk to nutchbase... The original reasonfor this was that we first wanted to bring nutchbase into a workingstate and then start merging, but also another important reason was theone mentioned above - we didn't know how to prepare a meaningful patchfor trunk that wouldn't replace 90+ % of the code in trunk...

So, I would like to propose an alternative strategy: we will keepmerging from trunk to nutchbase, with proper JIRA tracking (I created a'nutchbase' tag in JIRA), and once we reach a state when nutchbaseoffers roughly the same functionality as the code in trunk then wesimply switch nutchbase with trunk.

Current status of nutchbase is that the basic tools to implement acrawling workflow have been ported and work correctly, and we are ableto execute a few unit tests on an SQL backend.

Regarding backwards-compatibility with Nutch 1.x: most config files areunchanged, and we should probably offer some data migration tools - I'mnot sure whether it makes sense to create a segment converter, but wecan certainly create a CrawlDb converter.


What do you think? Any comments / suggestions / ideas?

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Nutchbase merge strategy

Reply via email to