Re: Next Generation Nutch

Dennis Kubes Mon, 14 Apr 2008 08:38:04 -0700


Otis Gospodnetic wrote:

I suppose the first thing to do would be describe the requirements for this 
shard management.  I imagine you have very specific functionality in mind from 
your Wikia Search experience.  Mind putting your ideas on the Wiki?  I think it 
would be very good to share this with [EMAIL PROTECTED] early on, so we can 
come up with something general that fits both Nutch and Solr.  It might turn 
out that this calls for a separate Lucene project, but we'll see that once the 
real discussion starts.

I completely agree. This would be better as a shared project. I willput my current thoughts down on the Nutch wiki, unless there is alreadya discussion going somewhere?


Dennis

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Dennis Kubes <[EMAIL PROTECTED]>
To: [email protected]
Sent: Sunday, April 13, 2008 5:44:32 PM
Subject: Re: Next Generation Nutch



Otis Gospodnetic wrote:
Hello,

A few quick comments.  I don't know how much you track Solr, but the mention of 
shards makes me think of SOLR-303 and DistributedSearch page on Solr Wiki.  You'll 
want to check those out.  In short, Solr has the notion of shards and distributed 
search, kind of like Nutch with its RPC framework and searchers.  *That* is one 
big duplication of work, IMHO.  As far as the indexing+searching+shards go, I 
think one direction worth looking at carefully would be the gentle Nutch->Solr 
relationship -- using Solr to do indexing and searching.  Shard management doesn't 
exist in either project yet, but I think it would be ideal to come up with a 
common management mechanism, if possible.
In thinking about a new nutch I always thought that shard management isabsolutely necessary but it never felt right in terms of where itbelongs. If we are saying that nutch is small tools strung together toproduce different types of search indexes, shard management isn't reallya tool. It is more of something after that is needed. And yes bothNutch and Solr as well as other people using lucene indexes need sometype of distributed index management system and I don't want toduplicate this work. Perhaps this is a good proposal for a separatelucene sub-project. Hey we could even call it shard ;)
Dennis
I think this addresses your "... but Nutch would need to improveon the search server and shard management side of things to be able toscale to the billion page level. So the next generation of Nutch Ithink should focus on web scale search." statement.
I know of a well-known, large corporation evaluating Solr (and its dist. search 
in particular) to handle 1-2B docs and 100 QPS.

I don't fully follow the part about getting rid of plugins, spring, etc., so I 
can't comment.

Regarding the webapp - perhaps Solr and SolrJ could be used here.  Solr itself 
is a webapp, and it contains various ResponseWriters that can output XML, JSON, 
pure Ruby, Python, even binary responses (in JIRA).

Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch

----- Original Message ----
From: Dennis Kubes <[EMAIL PROTECTED]>
To: [email protected]
Sent: Friday, April 11, 2008 5:59:41 PM
Subject: Next Generation Nutch
I have been thinking about a next generation Nutch for a while now, hadsome talks with some of the other committers, and have gotten around toputting some thoughts / requirements down on paper. I wanted to runthese by the community and get feedback. This message will be a bitlong so please bear with me.
First let me define that I think that the purpose of Nutch is to be aweb search engine. When I say that I mean to specifically excludeenterprise search. By web search I am talking about general or verticalsearch engines in the 1M-20B document range. I am excluding things suchas database centric search and possibly even local filesystem search.IMO Solr is a very capable enterprise search product and could handlelocal filesystem search (if it doesn't already) and Nutch shouldn't tryto overlap functionality. I think it should be able to interact, maybeshare indexes yes, but not overlap purpose. I think that Nutch shouldbe designed to handle large datasets, meaning it has the ability toscale to billions, perhaps 10s of billions of pages. Hadoop alreadygives us this capability for processing but Nutch would need to improveon the search server and shard management side of things to be able toscale to the billion page level. So the next generation of Nutch Ithink should focus on web scale search.
After working with Hadoop and MapReduce for the last couple of years Ifind it interesting just how similar development of MapReduce programsseem to be to the linux/unix philosophy of small programs chainedtogether to accomplish big things. So going forward I see this as ahealthy overall general architecture. Nutch would have many small toolsthat would be linked through data structures. We already do this tosome extent in the current version of Nutch, an example of which wouldbe the tools that generate and act on CrawlDatum objects (CrawlDb,UpdateDb, etc.). I would like to keep that idea of tools and datastructures wth the tools are chained together perhaps only by shell ormanagement scripts, in different pipelines acting on the datastructures. When I say data structure I don't mean binary map orsequence files. These may be a standard way to store these objects butHadoop allows any input / output formats whether that be to HBase, arelational database, a local filesytem. I think we should be open tohave those data structures stored however is best for the user throughdifferent hadoop formats. So a general overall architecture of toolsand data structures and pipelines of these tools.
I currently see five or six distinct phases to a web search engine.They are; Acquire, Parse, Analyze, Index, Search, and Shard Management.Ok shard management might not be so much a phase as a functionality.Acquire is simply the acquisition of the document be it PDF, HTML, orimages. This would usually be the crawler phase. Parse is parsing thatcontent into useful and standard data structures. I do believe thatparsing should be separate and distinct from crawling. If you crawl 50%of 5M pages and the crawler dies, you should still be able to use that50% you crawled. Analyze is what we do with the content once it isparsed into a standard structure we can use. This could be anythingfrom a better link analysis to natural language processing, languageidentification, and machine learning. The analysis phase shouldprobably have an ever expanding set of tools for different purposes.These tools would create specialized data structures of their own.Eventually through all the analysis we come up with a score for a givenpiece of content. That could be a document or a field. Indexing is theprocess of taking the analysis scores and content and creating theindexes for searching. Searching is concerned with the searching of theindexes. This should be doable from command line, web based, or otherways. Shard management is concerned with the deployment and managementof large number of indexes.
I think the next generation of nutch should allow the changing ofdifferent tools in any of these areas. What this means is the abilityto have different components such as web crawlers (as long as the enddata structure is the same), for example Fetcher, Fetcher2, Grub,Heretrix, or even specialized crawlers. And different components fordifferent analysis types. I don't see a lot of cross-cutting concernshere. And where there is, url normalization for example, I think it canbe handled better through dependency injection.
Which brings me to three. I think it is time to get rid of the pluginframework. I want to keep the functionality of the various plugins butI think a dependency injection framework, such as spring, creating thecomponents needed for logic inside of tools is a much cleaner way toproceed. This would allow much better unit and mock testing of tool andlogic functionality. It would allow Nutch to run on a non "nutchified"Hadoop cluster, meaning just a plain old hadoop cluster. We could havecore jars and contrib jars and a contrib directory which is pulled fromby shell scripts when submitting jobs to Hadoop. With themultiple-resources functionality in Hadoop it would be a simple matterof creating the correct command lines for the job to run.
And that brings me to separation of data and presentation. Currentlythe Nutch website is one monolithic jsp application with plugins. Ithink the next generation should segment that out into xml / json feedsand a separate front end that uses those feeds. Again this would makeit much easier to create web applications using nutch.
And of course I think that shard management, a la Hadoop master andslave style, is a big requirement as well. I also think a full testsuite with mock objects and local and MiniMR and MiniDFS cluster testingis important as is better documentation and tutorials (maybe even a book:)). So up to this point I have created MapReduce jobs that use springfor dependency injection and it is simple and works well. The above isthe direction I would like to head down but I would also like to seewhat everyone else is thinking.
Dennis

Re: Next Generation Nutch

Reply via email to