Re: Nutchbase merge strategy

Doğacan Güney Thu, 22 Jul 2010 04:55:20 -0700

Hey,

Sorry for the late answer everyone.


On Wed, Jul 21, 2010 at 21:26, Andrzej Bialecki <[email protected]> wrote:

> Hi all,
>
> I'd like to discuss what is the best way forward to merging the nutchbase
> code with trunk.
>
> First some important facts:
>
> * nutchbase is almost totally API incompatible with Nutch 1.x. While the
> main ideas remain the same, and most of the tools remain as well, their
> implementation is very different (and let me say, much cleaner) than that of
> Nutch 1.x. E.g. while nutchbase uses URLFilters and URLNormalizers, and
> IndexingFilters, etc, their method signatures have changed. To give you some
> idea how deep these changes go, let me say that CrawlDatum is gone now.
>
> * for the last month or so, and I foresee for another month or so, Julien,
> Dogacan, myself and Enis have been working on bringing nutchbase (and Gora)
> as much up-to-date with trunk as possible - in fact, you could say we have
> been merging trunk to nutchbase... The original reason for this was that we
> first wanted to bring nutchbase into a working state and then start merging,
> but also another important reason was the one mentioned above - we didn't
> know how to prepare a meaningful patch for trunk that wouldn't replace 90+ %
> of the code in trunk...
>
> So, I would like to propose an alternative strategy: we will keep merging
> from trunk to nutchbase, with proper JIRA tracking (I created a 'nutchbase'
> tag in JIRA), and once we reach a state when nutchbase offers roughly the
> same functionality as the code in trunk then we simply switch nutchbase with
> trunk.
>
> Current status of nutchbase is that the basic tools to implement a crawling
> workflow have been ported and work correctly, and we are able to execute a
> few unit tests on an SQL backend.
>
> Regarding backwards-compatibility with Nutch 1.x: most config files are
> unchanged, and we should probably offer some data migration tools - I'm not
> sure whether it makes sense to create a segment converter, but we can
> certainly create a CrawlDb converter.
>
> What do you think? Any comments / suggestions / ideas?
>
>

I am ok with this approach. One thing that may be problematic is that this
flattens SVN history a lot so history will both be more difficult
to read AND code (that committers commit) will not be properly attributed.

Btw, I recently tried a git merge between trunk and nutchbase. IIRC, there
were only 10-15 file conflicts. So I think producing a patch
*may* be possible.


>  --
> Best regards,
> Andrzej Bialecki     <><
>  ___. ___ ___ ___ _ _   __________________________________
> [__ || __|__/|__||\/|  Information Retrieval, Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System Integration
> http://www.sigram.com  Contact: info at sigram dot com
>
>


-- 
Doğacan Güney

Re: Nutchbase merge strategy

Reply via email to