Jérôme Charron wrote:
I don't take a look yet at mapred branch.
It will going to be a good surprise to discover it in the trunk... ;-)

I will make some effort to document things more before I merge to trunk, so that folks know what they're getting. Many things have changed (e.g., segment format). Several things have not yet been fully worked out and/or implemented (e.g., segment merging). But the basics are all working (intranet and & whole-web crawling, indexing & search), both in standalone and distributed configurations. My focus has been stress testing the distributed infrastructure (NDFS & MapReduce). We've discovered and fixed a number of bugs in this over recent weeks, so it is getting ever more stable. I'm hoping that others can help fill in the gaps in tools.

Once the merge is done I'd like to make a few other changes.

These are:

1. Remove most static references to NutchConf outside of main() routines. The MapReduce-based versions of the command line tools have no such references. The biggest change here will be to plugins. Plugins APIs should probably all be modified to use a factory, and the factory should be constructed from a NutchConf, e.g., something like:
  public static PluginXFactory PluginXFactory.getFactory(NutchConf);
  public PluginX PluginXFactory.getPlugin(...);
This should permit folks to more easily configure things programatically (think JMX) and to run multiple configurations in a single JVM.

2. FetchListEntry has been mostly replaced with a new, simpler datastructure called a CrawlDatum. FetchListEntry is used in the IndexingFilter API to pass the url, fetch date and incoming anchors. Currently, in the mapred branch, the indexer creates a dummy FetchListEntry to pass to plugins. But instead the IndexingFilter API should probably be altered to pass the CrawlDatum, anchors and url.

I have avoided making these changes since they would make it difficult to merge improvements to plugins into the mapred branch. But, once we have moved mapred to trunk, we should make them soon. Incompatible API changes are best made early, so that folks have more time to work with them.

Does this all sound reasonable?

Doug

Reply via email to