Re: Next Generation Nutch

ogjunk-nutch Fri, 18 Apr 2008 13:44:44 -0700

Thanks for the explanation, Chris.
As for a separate crawler, there is Droids.  I don't know its exact state, but 
I did see it had a couple of GSoC entries.


Otis
--
Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch


----- Original Message ----
From: Chris Mattmann <[EMAIL PROTECTED]>
To: [email protected]
Sent: Saturday, April 12, 2008 12:29:15 AM
Subject: Re: Next Generation Nutch

Hi Otis,

Thanks for your comments. My responses inline below:

> 
> Hm, I have to say I'm not sure if I agree 100% with part 1.  I think it would
> be great to have such flexibility, but I wonder if trying to achieve it would
> be over-engineering.  Do people really need that?  I don't know, maybe!  If
> they do, then ignore my comment. :)

Well, in the past, at least in my experience, this is exactly what has paid
off for us. Having the flexibility to architect a system that isn't tied to
the underlying technology. We once had a situation at JPL where a software
product was using CORBA for its underlying middleware implementation
framework. This (previously free) CORBA solution turned into a 30K/year
licensed solution, at the direction of the vendor in a 1 week timeframe.
Because we had architected and engineered our software system to be
independent of the underlying middleware substrate, we were able to switch
over to a free, Java-RMI based solution in the matter of a weekend.

Of course, this is typically bound to certain classes of underlying
substrates, and middleware solutions (e.g., it would be difficult to switch
out certain middlewares with vastly different architectural styles, say, if
we were trying to switch from CORBA to a P2P based solution like JXTA), but
all I'm saying is that it would be great if we didn't have to dictate to a
potential Nutch 2.0 user that to use our scalable, open source search engine
solution, you have to change from a JMS house to a Hadoop house. It would be
nice to say that we've architected Nutch 2.0 to be independent of the
underlying middleware provider. Of course, we can provide a default
implementation based on the existing Hadoop substrate, but we should provide
interfaces, data components, and architectural guidelines to be able to
change to say, a Nutch solution over XML-RPC, or Web-Services, or JMS,
without breaking the core architecture. Right now, I'm convinced that can't
be done, or in other words, it's too hard to tease the Hadoop notions out of
Nutch as it exists today.

> 
> I'm curious about 2. - could you please explain a little what you mean by "too
> tied to the underlying
> orchestration process and infrastructure."?

What I mean by this is that the Fetcher/Fetcher2 dictates the orchestration
process for crawling: there is no separate, independent Nutch crawler.
Fetcher2 itself is a MapRunnable job (e.g., a term from the Hadoop
vocabulary). In my mind, the crawler process needs to be a separate
subsystem in Nutch, independent of the underlying middleware substrate (kind
of like I'm suggesting above). As an example: how would we take the existing
Nutch Fetcher2, and run it over JMS? Or XML-RPC? Or RMI?

So, I guess that's all I'm saying -- the Nutch 2.0 architecture should be
clearly insulated from the underlying middleware technology. That's my main
concern moving forward.

Hope that helps to explain my point of view. :) If not, let me know and I
would be happy to chat more about it. Thanks!

Cheers,
Chris


> 
> Thanks,
> Otis
> --
> Sematext -- http://sematext.com/ -- Lucene - Solr - Nutch
> 
> ----- Original Message ----
> From: Chris Mattmann <[EMAIL PROTECTED]>
> To: [email protected]
> Sent: Friday, April 11, 2008 9:10:30 PM
> Subject: Re: Next Generation Nutch
> 
> Hi Dennis,
> 
> Thanks for putting this together. I think that it's also important to add to
> this list the ability to cleanly separate out the following major
> components:
> 
> 1. The underlying distributed computing infrastructure (e.g., why does it
> have to be Hadoop? Shouldn't Nutch be able to run on top of JMS, or XML-RPC,
> or what about even grid computing technologies, and web services? Hadoop can
> certainly be _the_ core implementation of the underlying substrate, but the
> ability to change this out should be a lot easier than it currently is. Read
> on below to see what I mean.)
> 
> 2. The crawler. Right now I think it's much too tied to the underlying
> orchestration process and infrastructure.
> 
> 3. The data structures. You do mention this below, but I would add to it
> that the data structures for Nutch should be simple POJOs and not have any
> tie to the underlying infrastructure (e.g., no need for Writeable methods,
> etc.)
> 
> I think that with these types of guiding principles above, along with what
> you mention below, there is the potential here to generate a really
> flexible, reusable architecture, that, when folks come along and mention,
> "Oh I've written Crawler XXX, how do I integrate it into Nutch", we don't
> have to come back and say that the entire system has to be changed; or even
> worse, that it cannot be done at all.
> 
> My 2 cents,
>  Chris
>  
> 
> 
> On 4/11/08 2:59 PM, "Dennis Kubes" <[EMAIL PROTECTED]> wrote:
> 
>> I have been thinking about a next generation Nutch for a while now, had
>> some talks with some of the other committers, and have gotten around to
>> putting some thoughts / requirements down on paper.  I wanted to run
>> these by the community and get feedback.  This message will be a bit
>> long so please bear with me.
>> 
>> First let me define that I think that the purpose of Nutch is to be a
>> web search engine.  When I say that I mean to specifically exclude
>> enterprise search.  By web search I am talking about general or vertical
>> search engines in the 1M-20B document range.  I am excluding things such
>> as database centric search and possibly even local filesystem search.
>> IMO Solr is a very capable enterprise search product and could handle
>> local filesystem search (if it doesn't already) and Nutch shouldn't try
>> to overlap functionality.  I think it should be able to interact, maybe
>> share indexes yes, but not overlap purpose.  I think that Nutch should
>> be designed to handle large datasets, meaning it has the ability to
>> scale to billions, perhaps 10s of billions of pages.  Hadoop already
>> gives us this capability for processing but Nutch would need to improve
>> on the search server and shard management side  of things to be able to
>> scale to the billion page level.  So the next generation of Nutch I
>> think should focus on web scale search.
>> 
>> After working with Hadoop and MapReduce for the last couple of years I
>> find it interesting just how similar development of MapReduce programs
>> seem to be to the linux/unix philosophy of small programs chained
>> together to accomplish big things.  So going forward I see this as a
>> healthy overall general architecture.  Nutch would have many small tools
>> that would be linked through data structures.  We already do this to
>> some extent in the current version of Nutch, an example of which would
>> be the tools that generate and act on CrawlDatum objects (CrawlDb,
>> UpdateDb, etc.).  I would like to keep that idea of tools and data
>> structures wth the tools are chained together perhaps only by shell or
>> management scripts, in different pipelines acting on the data
>> structures.  When I say data structure I don't mean binary map or
>> sequence files.  These may be a standard way to store these objects but
>> Hadoop allows any input / output formats whether that be to HBase, a
>> relational database, a local filesytem.  I think we should be open to
>> have those data structures stored however is best for the user through
>> different hadoop formats.  So a general overall architecture of tools
>> and data structures and pipelines of these tools.
>> 
>> I currently see five or six distinct phases to a web search engine.
>> They are;  Acquire, Parse, Analyze, Index, Search, and Shard Management.
>>   Ok shard management might not be so much a phase as a functionality.
>> Acquire is simply the acquisition of the document be it PDF, HTML, or
>> images.  This would usually be the crawler phase.  Parse is parsing that
>> content into useful and standard data structures.  I do believe that
>> parsing should be separate and distinct from crawling.  If you crawl 50%
>> of 5M pages and the crawler dies, you should still be able to use that
>> 50% you crawled.  Analyze is what we do with the content once it is
>> parsed into a standard structure we can use.  This could be anything
>> from a better link analysis to natural language processing, language
>> identification, and machine learning.  The analysis phase should
>> probably have an ever expanding set of tools for different purposes.
>> These tools would create specialized data structures of their own.
>> Eventually through all the analysis we come up with a score for a given
>> piece of content.  That could be a document or a field.  Indexing is the
>> process of taking the analysis scores and content and creating the
>> indexes for searching.  Searching is concerned with the searching of the
>> indexes.  This should be doable from command line, web based, or other
>> ways.  Shard management is concerned with the deployment and management
>> of large number of indexes.
>> 
>> I think the next generation of nutch should allow the changing of
>> different tools in any of these areas.  What this means is the ability
>> to have different components such as web crawlers (as long as the end
>> data structure is the same), for example Fetcher, Fetcher2, Grub,
>> Heretrix, or even specialized crawlers.  And different components for
>> different analysis types.  I don't see a lot of cross-cutting concerns
>> here.  And where there is, url normalization for example, I think it can
>> be handled better through dependency injection.
>> 
>> Which brings me to three.  I think it is time to get rid of the plugin
>> framework.  I want to keep the functionality of the various plugins but
>> I think a dependency injection framework, such as spring, creating the
>> components needed for logic inside of tools is a much cleaner way to
>> proceed.  This would allow much better unit and mock testing of tool and
>> logic functionality.  It would allow Nutch to run on a non "nutchified"
>> Hadoop cluster, meaning just a plain old hadoop cluster.  We could have
>> core jars and contrib jars and a contrib directory which is pulled from
>> by shell scripts when submitting jobs to Hadoop.  With the
>> multiple-resources functionality in Hadoop it would be a simple matter
>> of creating the correct command lines for the job to run.
>> 
>> And that brings me to separation of data and presentation.  Currently
>> the Nutch website is one monolithic jsp application with plugins.  I
>> think the next generation should segment that out into xml / json feeds
>> and a separate front end that uses those feeds.  Again this would make
>> it much easier to create web applications using nutch.
>> 
>> And of course I think that shard management, a la Hadoop master and
>> slave style, is a big requirement as well.  I also think a full test
>> suite with mock objects and local and MiniMR and MiniDFS cluster testing
>> is important as is better documentation and tutorials (maybe even a book
>> :)).  So up to this point I have created MapReduce jobs that use spring
>> for dependency injection and it is simple and works well.  The above is
>> the direction I would like to head down but I would also like to see
>> what everyone else is thinking.
>> 
>> Dennis
>> 
>> 
>> 
>> 
>> 
>> 
>> 
>> 
> 
> ______________________________________________
> Chris Mattmann, Ph.D.
> [EMAIL PROTECTED]
> Cognizant Development Engineer
> Early Detection Research Network Project
> _________________________________________________
> Jet Propulsion Laboratory            Pasadena, CA
> Office: 171-266B                     Mailstop:  171-246
> _______________________________________________________
> 
> Disclaimer:  The opinions presented within are my own and do not reflect
> those of either NASA, JPL, or the California Institute of Technology.
> 
> 
> 
> 
> 

______________________________________________
Chris Mattmann, Ph.D.
[EMAIL PROTECTED]
Cognizant Development Engineer
Early Detection Research Network Project
_________________________________________________
Jet Propulsion Laboratory            Pasadena, CA
Office: 171-266B                     Mailstop:  171-246
Phone:  818-354-8810
_______________________________________________________

Disclaimer:  The opinions presented within are my own and do not reflect
those of either NASA, JPL, or the California Institute of Technology.

Re: Next Generation Nutch

Reply via email to