[Nutch Wiki] Update of "Nutch2Architecture" by DennisKubes

Apache Wiki Thu, 17 Apr 2008 10:22:13 -0700

Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Nutch Wiki" for change 
notification.


The following page has been changed by DennisKubes:
http://wiki.apache.org/nutch/Nutch2Architecture

The comment on the change is:
Rough Draft of Nutch 2 Architecture

------------------------------------------------------------------------------
  = Nutch 2.0 Architecture =
  
- The purpose of this page is to discuss ideas for the architecture of the next 
generation of Nutch.  This page will be fleshed out more with different 
sections shortly but for now feel free to write down any ideas or desires you 
may have for Nutch2.
+ === Overview ===
+   1.Reuse of existing Nutch codebase
+     1.While some things will change this architecture is more of a refactor 
than a complete re-write.  Much of the existing codebase including plugin 
functionality should be reused.
+   2.Dependency Injection
+     1.Remove the plugin framework and use a DI framework, Spring for example, 
to create mapper and reducer classes that are auto injected with dependencies.  
This will take modifications to the Hadoop codebase.
+   2.Have mock objects that make it easy to test jobs.
  
+ === Data Structures ===
+   1.CrawlBase
+     1.Url â CrawlHistory
+     2.CrawlHistory is a list of CrawlDatum objects ordered by reverse date
+     3.CrawlDatum has Metadata
+   2.CrawlList
+     1.Url â CrawlHistory
+     2.Separate from CrawlBase for Multiple concurrent crawls
+   3.FetchedContent
+     1.Url â BytesWritable, FetchStatus
+     2.FetchStatus would be a status of the fetch, error codes, any fetch 
information.  This would then be translated by another tool back into the 
CrawlBase.  FetchStatus has Metadata.
+   4.ParsedContent
+     1.Url â MapWritable
+     2.MapWritable would contain Text â Writable or Writable[] and would 
allow the parsing of all different types of elements within the content (href, 
headers, etc.)
+   5.Processing
+     1.Processing would take the ParsedContent and translate that into 
multiple specific data parts. These data parts aren't used by any part of the 
system except Scoring.
+     2.Processing would be specific functions including updating the 
CrawlBase, peforming analysis on ParsedContent, Integration of data from other 
sources.
+     3.Some processors would translate content into formats needed by scorers.
+     4.Processors are not constrained by specific data structures to allow 
flexibility in analysis, updating, blocking or removal, and other forms of data 
processing.  The only requirement is scoring programs must be able to access 
processing output data structure in a one to one relationship.
+   6.Scoring
+     1.Url â Field
+     2.Url â Float
+     3.Field is a name, value(s), and score, being Text, Text, and Float 
respectively.
+     4.The fields become the fields that are indexed with the scores becoming 
field boosts.
+     5.Scoring takes the specific data parts from analysis and outputs the 
above formats.
+     6.Field needs lucene semantics.
+     7.Indexing
+       1.Indexing indexes Fields for a document according to the field values 
and boosts.  Indexing does not determine either field values or boost values.
+       2.Indexing aggregates document boosts to create a final document score.
+ 
+ === Tools ===
+   1.Injector
+     1.Injects data sources into the CrawlBase creating new CrawlBase, 
CrawlHistory objects.
+     2.This could also be used to update the status or change the state of 
Urls in the CrawlBase manually.
+   2.Generator
+     1.Creates CrawlLists from the CrawlBase
+     2.Filters could be created to run on only a subset of the CrawlBase Urls.
+    3.Fetcher
+     1.Fetches CrawlLists objects and creates FetchedContent objects.
+   4.UpdateCrawl
+     1.Updates the CrawlBase Urls with the FetchedStatus objects of the 
FetchedContent.
+     2.This does not add new Urls to the database, only updates current Urls.
+   5.Parser
+     1.Creates ParsedContent objects from FetchContent sources.
+     2.Run multiple different parsers based on different conditions.
+   6.Processors
+     1.New Url Processor
+       1.A  processor which updates the CrawlBase with new urls parsed from 
ParsedContent sources.
+     2.Html Processor
+       1.Does specific processing on Html sources from ParsedContent.
+     3.Link Processor
+       1.Creates a specific database of Url â Inlinks and Outlinks.
+     4.BlackList Processor
+       1.Processor which removes urls and their content from being indexed if 
they are on a blacklist.
+     5.Other Processors
+       1.There should be many other tools here that perform specific functions 
such at language identification, handling redirected urls and scoring, etc.
+   7.Scorers
+     1.Html Scorer
+       1.Scores Html analysis
+       2.Link Scorer
+         1.Create a page-rank type score from the Link Analysis.
+       3.Other Scorers
+   8.Indexer
+     1.Create Lucene indexes from multiple Scoring objects.
+   9.Query tools
+ 
+ === Supporting Infrastructure ===
+   1.Shard management
+     1.Perhaps a separate project to be shared by Lucene, Solr, and Nutch.
+     2.Nutch shards need other content besides indexes, summaries and links 
for example.
+   2.Cluster management
+   3.Automated job streams for nutch processes
+   4.Build and command line scripts
+     1.Allow packaging of all core and third-party contrib jars to run on a 
standard hadoop cluster.  
+     2.People should be able to create an extension and drop in a jar and it 
just runs, no need to deploy jar manually to all slaves.
+   5.Full unit testing suite, documentation and tutorials
+     1.Maybe a book would be good.  We definitely need documentation to lead a 
person from installation to extension.
+

[Nutch Wiki] Update of "Nutch2Architecture" by DennisKubes

Reply via email to