Jérôme Charron wrote:
I don't take a look yet at mapred branch.
It will going to be a good surprise to discover it in the trunk... ;-)
I will make some effort to document things more before I merge to trunk,
so that folks know what they're getting. Many things have changed
(e.g., segment format). Several things have not yet been fully worked
out and/or implemented (e.g., segment merging). But the basics are all
working (intranet and & whole-web crawling, indexing & search), both in
standalone and distributed configurations. My focus has been stress
testing the distributed infrastructure (NDFS & MapReduce). We've
discovered and fixed a number of bugs in this over recent weeks, so it
is getting ever more stable. I'm hoping that others can help fill in
the gaps in tools.
Once the merge is done I'd like to make a few other changes.
These are:
1. Remove most static references to NutchConf outside of main()
routines. The MapReduce-based versions of the command line tools have
no such references. The biggest change here will be to plugins.
Plugins APIs should probably all be modified to use a factory, and the
factory should be constructed from a NutchConf, e.g., something like:
public static PluginXFactory PluginXFactory.getFactory(NutchConf);
public PluginX PluginXFactory.getPlugin(...);
This should permit folks to more easily configure things programatically
(think JMX) and to run multiple configurations in a single JVM.
2. FetchListEntry has been mostly replaced with a new, simpler
datastructure called a CrawlDatum. FetchListEntry is used in the
IndexingFilter API to pass the url, fetch date and incoming anchors.
Currently, in the mapred branch, the indexer creates a dummy
FetchListEntry to pass to plugins. But instead the IndexingFilter API
should probably be altered to pass the CrawlDatum, anchors and url.
I have avoided making these changes since they would make it difficult
to merge improvements to plugins into the mapred branch. But, once we
have moved mapred to trunk, we should make them soon. Incompatible API
changes are best made early, so that folks have more time to work with them.
Does this all sound reasonable?
Doug