Hi Chris, On Sat, Jul 3, 2010 at 18:35, Mattmann, Chris A (388J) < [email protected]> wrote:
> Guys, > > This sounds awesome. Even I could understand it, which is saying something! > :) > > My only question: why introduce a new data structure called “Markers” when > all that seems to be is a Metadata object. Let’s use > o.a.tika.metadata.Metadata to represent that? My only comment then would be, > aren’t we still doing something you mentioned you wanted to get rid of > below, where you said: “For example, during parsing we don't have access to > a URL's fetch status. So we copy fetch status into content metadata.” Aren’t > we just doing the same thing with Markers? > > Actually, markers used to be stored in the metadata object in WebPage (metadata is a map from string to bytes). It just seemed clearer to me to put it into its own field. We can discuss if moving it back into metadata makes more sense. One thing: We can't use tika's metadata object as WebPage object is generated from an avro schema. As for your last comment: Markers are only used to identify where we are in a crawl cycle and the individual crawl ids. So during parse, when we get a URL during MapReduce, parse can easily check if that URL has been fetched in *that* crawl cycle (since there is no point in parsing it if it hasn't been fetched). So it is not used to pass any important information around. It is just a simple tracking system. Did this make it any clearer? > Cheers, > Chris > > > > > > > On 7/3/10 3:01 AM, "Doğacan Güney" <[email protected]> wrote: > > Hello everyone, > > I am attaching first draft of a complete nutchbase design document. There > are parts missing and parts not yet explained clearly but I would like to > get everyone's opinion on what they think so far. Please let me > know which parts are unclear, which parts make no sense etc, and I will > improve the draft. > > > ----------------------------------------------------------------------------------------------------------------------------------------- > > Nutchbase > ========= > > 1) Rationale > > * All your data in a central location (at least, nutch gives you the > illusion of a centralized storage) > * No more segment/crawldb/linkdb merges. > * No more "missing" data in a job. There are a lot of places where we copy > data from one structure to another just so that it is available in a later > job. For example, during parsing we don't have access to a URL's fetch > status. So we copy fetch status into content metadata. This will no longer > be necessary after nutchbase. When writing a job or a new plugin, programmer > only needs to specify which fields she wants to read and they will be > available to plugin / job. > * A much simpler data model. If you want to update a small part in a single > record, now you have to write a MR job that reads the relevant directory, > change the single record, remove old directory and rename new directory. > With nutchbase, you can just update that record. > > 2) Design > > As mentioned above, all data for a URL is stored in a WebPage object. This > object is accessed by a key that is the reverse form of a URL. For example, > > http://bar.foo.com:8983/to/index.html?a=b becomes com.foo.bar: > http:8983/to/index.html?a=b > > If URLs are stored lexicographically, this means that URLs from same domain > and host are stored closer together. This will hopefully make developing > statistics tools easier for hosts and domains. > > Writing a MapReduce job that uses Gora for storage does not take much > effort. There is a new class called StorageUtils that has a number of static > methods to make setting mappers/reducers/etc easier. Here is an example > (from GeneratorJob.java): > > Job job = new NutchJob(getConf(), "generate: " + crawlId); > StorageUtils.initMapperJob(job, FIELDS, SelectorEntry.class, > WebPage.class, > GeneratorMapper.class, URLPartitioner.class); > StorageUtils.initReducerJob(job, GeneratorReducer.class); > > An important argument is the second argument to #initMapperJob. This > specifies all the fields that this job will be reading. If plugins will run > during a job (for example, during ParserJob, several plugins will be > active), then before job is run, those plugins must be initialized and > FieldPluggable#getFields must be called for all plugins to figure out which > fields they want to read. During map or reduce phase, modifying WebPage > object is as simple as using the built-in setters. All changes will be > persisted. > > Even though some of these objects are still there, most of the CrawlDatum, > Content, ParseData or similar objects are removed (or are slated to be > removed). In most cases, plugins will simply take (String key, WebPage page) > as arguments and modify WebPage object in-place. > > 3) Jobs > > Nutchbase uses the concept of a marker to identify what has been processed > and what will be processed from now on. WebPage object contains a > map<String, String> called markers. For example, when GeneratorJob generates > a URL, it puts a unique string and a unique crawl id to this marker map. > Then FetcherJob only fetches a given URL if it contains this marker. At the > end of the crawl cycle, DbUpdaterJob clears all markers (except markers > placed by IndexerJob). > > a) InjectorJob: This phase consists of two different jobs. First job reads > from a text file (as before) then puts a special inject marker. Second job > goes over all URLs then if it finds a WebPage with inject marker but nothing > else (so, it is a new URL), then this URL is injected (and marker is > deleted). Otherwise, marker is just deleted (since URL is already injected). > > b) GeneratorJob: GeneratorJob is similar to what it was before, but it is > now a single job. During map phase: > > if FetchSchedule indicates given URL is to be fetched then, > Calculate generator score using scoring filters > Output <SelectorEntry<URL, score>, WebPage> > > SelectorEntry is sorted according to given score so highest scoring entries > will be processed first during reduce. > > Then URLPartitioner partitions according to choices specified in config > files (by host, by ip or by domain). > > Reduce phase counts URL according to topN and host/domain limits and marks > all URLs if limits are not yet reached. > > GeneratorJob markes all generated URLs with a unique crawl id. Fetcher, > parser and indexer jobs can then use this crawl id to process only URLs that > are in that particular crawl cycle. Alternatively, they can also work on all > URLs (though, again, FetcherJob will only work on a URL which has been > marked by GenerateJob. So even if FetcherJob is instructed to work on "all" > URLs, it will skip those without generator markers). > > GeneratorJob will also print crawl id to console. > > c) FetcherJob: During map phase, FetcherJob outputs <Integer, > FetcherEntry<key, WebPage>> pairs. The first member of the pair (integer) is > a random integer between 0 and 65536. After map, these pairs are partitioned > into reduces according to URL's hosts. Reduce works exactly like old > Fetcher's map. FetcherJob can also continue interrupted fetches now (by > giving "-continue" switch on command line). > > The random integer may seem pointless but it is actually quite important > for performance. Let's say instead of a random integer we were to just > output <key, WebPage> pairs from map. Again, let's say you have 1000 URLs > from host a.com <http://a.com> , and 500 URLs from b.com <http://b.com> . > Let's also say that we have 10 fetcher threads so maximum queue size is 500. > In this case, since all URLs from host a.com <http://a.com> will be > processed by reduce before all URLs from b.com <http://b.com> , during > reduce phase, only one thread will fetch URLs from a.com <http://a.com> > while every other thread will be spin-waiting. However, with randomization, > URLs from a.com <http://a.com> and b.com <http://b.com> will be > processed in a random order thus bandwidth utilization will be higher. > > > d) ParserJob: ParserJob is straightforward. It is only a map (i.e., has 0 > reducers). Simple parses all URLs with active parse plugins. > > e) DbUpdaterJob: This is a combination of updatedb and invertlinks jobs. If > a URL is successfully parsed (which means, it will contain a parse marker), > DbUpdaterJob will put its own marker. Note: It may make more sense to put a > marker even if a URL is not successfully parsed. DbUpdaterJob also cleans > all other markers. > > f) IndexerJob: Goes over all URLs with a db update marker (again, you can > specify ALL URLs with update markers, or a crawl id), and indexes them. > > 4) What's missing > > Most of the core functionality and plugins have been ported. However, some > tools and features are still missing: arc segment tools, PageRank scoring, > field indexing API, etc.... > > > > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Chris Mattmann, Ph.D. > Senior Computer Scientist > NASA Jet Propulsion Laboratory Pasadena, CA 91109 USA > Office: 171-266B, Mailstop: 171-246 > Email: *[email protected] > *WWW: *http://sunset.usc.edu/~mattmann/ > *++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > Adjunct Assistant Professor, Computer Science Department > University of Southern California, Los Angeles, CA 90089 USA > ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++ > > -- Doğacan Güney

