Hi all,
I'd like to discuss what is the best way forward to merging the
nutchbase code with trunk.
First some important facts:
* nutchbase is almost totally API incompatible with Nutch 1.x. While the
main ideas remain the same, and most of the tools remain as well, their
implementation is very different (and let me say, much cleaner) than
that of Nutch 1.x. E.g. while nutchbase uses URLFilters and
URLNormalizers, and IndexingFilters, etc, their method signatures have
changed. To give you some idea how deep these changes go, let me say
that CrawlDatum is gone now.
* for the last month or so, and I foresee for another month or so,
Julien, Dogacan, myself and Enis have been working on bringing nutchbase
(and Gora) as much up-to-date with trunk as possible - in fact, you
could say we have been merging trunk to nutchbase... The original reason
for this was that we first wanted to bring nutchbase into a working
state and then start merging, but also another important reason was the
one mentioned above - we didn't know how to prepare a meaningful patch
for trunk that wouldn't replace 90+ % of the code in trunk...
So, I would like to propose an alternative strategy: we will keep
merging from trunk to nutchbase, with proper JIRA tracking (I created a
'nutchbase' tag in JIRA), and once we reach a state when nutchbase
offers roughly the same functionality as the code in trunk then we
simply switch nutchbase with trunk.
Current status of nutchbase is that the basic tools to implement a
crawling workflow have been ported and work correctly, and we are able
to execute a few unit tests on an SQL backend.
Regarding backwards-compatibility with Nutch 1.x: most config files are
unchanged, and we should probably offer some data migration tools - I'm
not sure whether it makes sense to create a segment converter, but we
can certainly create a CrawlDb converter.
What do you think? Any comments / suggestions / ideas?
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com