nutch-dev  

Nutch is resilient to automated testing

Rick Moynihan
Mon, 04 Aug 2008 10:39:49 -0700

Hi all,

A colleague I have been working with has developed a plugin to index content with Nutch. And though it does the job admirably, the complexity and design of Nutch has proven resistent to easily writing automated tests for this component.

I'm desperately trying to write some JUnit unit/integration tests for this component, however Nutch doesn't make this simple enough, and I fear this amongst other things is a barrier to Nutch adoption.

What I want to do is:

- Setup a Jetty server within the test with the content I want to index (easy enough with CrawlDBTestUtil) - Configure a crawl (i.e. fetch, index, merge, dedup etc...) and override the configuration with my plugin and configuration.
- Store the index (preferably in memory, but on the disk is ok).
- assert that particular searches return items etc...


At first I thought this would be a simple matter of using CrawlDBTestUtil to establish the server side, then using org.apache.nutch.crawl.Crawl to perform all the relevant steps resulting in an index of the content, which I can then run assertions on via NutchBean.

Ideally I'd like to create just one Configuration object, override the settings as I wish, and then pass this object into Crawl and NutchBean appropriately.

Sadly however org.apache.nutch.crawl.Crawl isn't really a class, as it really only has a static main method which performs all the operations in batch. This design makes the class hard to reuse within the context of my test. This leaves me with the following options:

- call the main method and pass it an ugly array of Strings to do what I require. This is ugly due also to assumptions underlying the design of this component (configuration files on the classpath etc...) Also it allows little or no reuse of configuration with other parts of the code (e.g. NutchBean).

- Copy/Paste/Modify Crawl into my test. The code in Crawl recently changed to account for hadoop 0.17, so I don't really want to do this only to find the API changes. Plus I believe that tests should be simple to read. Explicitly performing 30 steps in order to test a component isn't a good idea, as it hides the forest for the trees.

CrawlDBTestUtil is a step in the right direction, but more work is needed. Is it possible to get this marked as a bug/feature-request and fixed in time for 1.0?

Thanks again for your help.

R.



  • Nutch is resilient to automated testing Rick Moynihan