I have put up a wiki page for discussions about a new nutch:
http://wiki.apache.org/nutch/Nutch2Architecture
I would like to discuss this some more and perhaps come up with some
basic tools to prove out concepts for a new architecture if nobody objects.
Dennis
Dennis Kubes wrote:
I have been thinking about a next generation Nutch for a while now, had
some talks with some of the other committers, and have gotten around to
putting some thoughts / requirements down on paper. I wanted to run
these by the community and get feedback. This message will be a bit
long so please bear with me.
First let me define that I think that the purpose of Nutch is to be a
web search engine. When I say that I mean to specifically exclude
enterprise search. By web search I am talking about general or vertical
search engines in the 1M-20B document range. I am excluding things such
as database centric search and possibly even local filesystem search.
IMO Solr is a very capable enterprise search product and could handle
local filesystem search (if it doesn't already) and Nutch shouldn't try
to overlap functionality. I think it should be able to interact, maybe
share indexes yes, but not overlap purpose. I think that Nutch should
be designed to handle large datasets, meaning it has the ability to
scale to billions, perhaps 10s of billions of pages. Hadoop already
gives us this capability for processing but Nutch would need to improve
on the search server and shard management side of things to be able to
scale to the billion page level. So the next generation of Nutch I
think should focus on web scale search.
After working with Hadoop and MapReduce for the last couple of years I
find it interesting just how similar development of MapReduce programs
seem to be to the linux/unix philosophy of small programs chained
together to accomplish big things. So going forward I see this as a
healthy overall general architecture. Nutch would have many small tools
that would be linked through data structures. We already do this to
some extent in the current version of Nutch, an example of which would
be the tools that generate and act on CrawlDatum objects (CrawlDb,
UpdateDb, etc.). I would like to keep that idea of tools and data
structures wth the tools are chained together perhaps only by shell or
management scripts, in different pipelines acting on the data
structures. When I say data structure I don't mean binary map or
sequence files. These may be a standard way to store these objects but
Hadoop allows any input / output formats whether that be to HBase, a
relational database, a local filesytem. I think we should be open to
have those data structures stored however is best for the user through
different hadoop formats. So a general overall architecture of tools
and data structures and pipelines of these tools.
I currently see five or six distinct phases to a web search engine. They
are; Acquire, Parse, Analyze, Index, Search, and Shard Management. Ok
shard management might not be so much a phase as a functionality.
Acquire is simply the acquisition of the document be it PDF, HTML, or
images. This would usually be the crawler phase. Parse is parsing that
content into useful and standard data structures. I do believe that
parsing should be separate and distinct from crawling. If you crawl 50%
of 5M pages and the crawler dies, you should still be able to use that
50% you crawled. Analyze is what we do with the content once it is
parsed into a standard structure we can use. This could be anything
from a better link analysis to natural language processing, language
identification, and machine learning. The analysis phase should
probably have an ever expanding set of tools for different purposes.
These tools would create specialized data structures of their own.
Eventually through all the analysis we come up with a score for a given
piece of content. That could be a document or a field. Indexing is the
process of taking the analysis scores and content and creating the
indexes for searching. Searching is concerned with the searching of the
indexes. This should be doable from command line, web based, or other
ways. Shard management is concerned with the deployment and management
of large number of indexes.
I think the next generation of nutch should allow the changing of
different tools in any of these areas. What this means is the ability
to have different components such as web crawlers (as long as the end
data structure is the same), for example Fetcher, Fetcher2, Grub,
Heretrix, or even specialized crawlers. And different components for
different analysis types. I don't see a lot of cross-cutting concerns
here. And where there is, url normalization for example, I think it can
be handled better through dependency injection.
Which brings me to three. I think it is time to get rid of the plugin
framework. I want to keep the functionality of the various plugins but
I think a dependency injection framework, such as spring, creating the
components needed for logic inside of tools is a much cleaner way to
proceed. This would allow much better unit and mock testing of tool and
logic functionality. It would allow Nutch to run on a non "nutchified"
Hadoop cluster, meaning just a plain old hadoop cluster. We could have
core jars and contrib jars and a contrib directory which is pulled from
by shell scripts when submitting jobs to Hadoop. With the
multiple-resources functionality in Hadoop it would be a simple matter
of creating the correct command lines for the job to run.
And that brings me to separation of data and presentation. Currently
the Nutch website is one monolithic jsp application with plugins. I
think the next generation should segment that out into xml / json feeds
and a separate front end that uses those feeds. Again this would make
it much easier to create web applications using nutch.
And of course I think that shard management, a la Hadoop master and
slave style, is a big requirement as well. I also think a full test
suite with mock objects and local and MiniMR and MiniDFS cluster testing
is important as is better documentation and tutorials (maybe even a book
:)). So up to this point I have created MapReduce jobs that use spring
for dependency injection and it is simple and works well. The above is
the direction I would like to head down but I would also like to see
what everyone else is thinking.
Dennis