Forgive me if this is a bit of a n00b question. I've been tasked with taking some other person's code and replacing all the DieselPoint code with Lucene/Nutch. What they do in DieselPoint is crawl specific parts of the web, then perform some proprietary splitting up of the returned pages into "chunks", and then the chunks themselves are indexed. Actually, I think they do it in a kind of a naive way, because it appears that DieselPoint crawls and indexes, and then this code goes through the index and creates chunk files, possibly several from any given initial page, and then DieselPoint is set loose to crawl and index those chunk files. Then the app uses *that* index in proprietary searches. I'm trying to learn my way around Nutch, and I'm wondering if there might be a way to get rid of the chunking stage by doing it directly in the initial crawl, possibly by writing a plugin. Unfortunately I'm under NDA so I can't give away too much of what the chunking process does, but I hope I've given enough information on what I'm trying to do. Is what I'm doing possible?
-- http://www.linkedin.com/in/paultomblin
