Hey, Sorry for the late answer everyone.
On Wed, Jul 21, 2010 at 21:26, Andrzej Bialecki <[email protected]> wrote: > Hi all, > > I'd like to discuss what is the best way forward to merging the nutchbase > code with trunk. > > First some important facts: > > * nutchbase is almost totally API incompatible with Nutch 1.x. While the > main ideas remain the same, and most of the tools remain as well, their > implementation is very different (and let me say, much cleaner) than that of > Nutch 1.x. E.g. while nutchbase uses URLFilters and URLNormalizers, and > IndexingFilters, etc, their method signatures have changed. To give you some > idea how deep these changes go, let me say that CrawlDatum is gone now. > > * for the last month or so, and I foresee for another month or so, Julien, > Dogacan, myself and Enis have been working on bringing nutchbase (and Gora) > as much up-to-date with trunk as possible - in fact, you could say we have > been merging trunk to nutchbase... The original reason for this was that we > first wanted to bring nutchbase into a working state and then start merging, > but also another important reason was the one mentioned above - we didn't > know how to prepare a meaningful patch for trunk that wouldn't replace 90+ % > of the code in trunk... > > So, I would like to propose an alternative strategy: we will keep merging > from trunk to nutchbase, with proper JIRA tracking (I created a 'nutchbase' > tag in JIRA), and once we reach a state when nutchbase offers roughly the > same functionality as the code in trunk then we simply switch nutchbase with > trunk. > > Current status of nutchbase is that the basic tools to implement a crawling > workflow have been ported and work correctly, and we are able to execute a > few unit tests on an SQL backend. > > Regarding backwards-compatibility with Nutch 1.x: most config files are > unchanged, and we should probably offer some data migration tools - I'm not > sure whether it makes sense to create a segment converter, but we can > certainly create a CrawlDb converter. > > What do you think? Any comments / suggestions / ideas? > > I am ok with this approach. One thing that may be problematic is that this flattens SVN history a lot so history will both be more difficult to read AND code (that committers commit) will not be properly attributed. Btw, I recently tried a git merge between trunk and nutchbase. IIRC, there were only 10-15 file conflicts. So I think producing a patch *may* be possible. > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ __________________________________ > [__ || __|__/|__||\/| Information Retrieval, Semantic Web > ___|||__|| \| || | Embedded Unix, System Integration > http://www.sigram.com Contact: info at sigram dot com > > -- Doğacan Güney

