Hi Alex, On Sat, Jul 3, 2010 at 14:45, Alex McLintock <[email protected]>wrote:
> Doğacan > > > 2010/7/3 Doğacan Güney <[email protected]>: > > I am attaching first draft of a complete nutchbase design document. There > > are parts missing and parts not yet explained clearly but I would like to > > get everyone's opinion on what they think so far. > > Thanks. I read your design and found it quite clear - at least to this > non committer. :-) > I would suggest that we should take this opportunity to do a full > design document including design which has not changed from v1 to v2. > So more please! > > I have made the odd comment which was then explained later on. I guess > that meant I was a bit confused :-) > > Thanks for the excellent comments. I will try to explain as best as I can. > > Please let me > > know which parts are unclear, which parts make no sense etc, and I will > > improve the draft. > > > The main thing I missed was any kind of overview of the data flow. I'd > like to see a description of how a url/webpage goes through our system > from being unknown, injected or discovered, to being queued for > fetching in the generator, to fetched, parsed, fetchscheduled, scored, > and generated again. > Plus of course, indexed by sending to solr and being seen by an end > user application. > > At each stage I'd like to see where the data is stored (text file, > hbase, solr) and especially how this differs from the previous (text > file, nutch crawldb, solr) > > I know that some of this may sound like a tutorial, but it is worth > doing now rather than putting it off until later. > One of the things nutchbase attempts to do is hide all the complexity of managing individual segments, crawl/link/whatever dbs from the user. Now, nutch delegates all storage handling to Gora (http://github.com/enis/gora). What Gora does is, it gives you a key-value store (in this case, keys are reversed URLs, values are WebPage objects), and you do all your work through these objects. So storage will not be an issue for you. Right now, Gora (and thus nutch) supports storing your data in hbase and sql (with cassandra and other backends coming soon). So with nutch and gora, you will start up your hbase/sql/cassandra/etc server(s), then nutch will figure out what to store and where. > > > ----------------------------------------------------------------------------------------------------------------------------------------- > > Nutchbase > > ========= > > 1) Rationale > > * All your data in a central location (at least, nutch gives you the > > illusion of a centralized storage) > > But hbase is distributed across your hadoop cluster, right? This is > the "illusion" you meant. > > Yes. Also, cassandra will be distributed too. Maybe in the future someone will write a HDFS-backed backend to Gora, then your data will actually live in separate files, but will still look like one centralized storage to you. > > > 2) Design > > As mentioned above, all data for a URL is stored in a WebPage object. > This > > object is accessed by a key that is the reverse form of a URL. For > example, > > http://bar.foo.com:8983/to/index.html?a=b becomes > > com.foo.bar:http:8983/to/index.html?a=b > > This was clear and is the main point to convey :-) I would in fact > like loads more info on the WebPage object. > > WebPage contains all data we have for a URL. Think Content + Parse Text + Parse Data + Crawl Datum + Outlinks + Inlinks... > > If URLs are stored lexicographically, this means that URLs from same > domain > > and host are stored closer together. This will hopefully make developing > > statistics tools easier for hosts and domains. > > > I am unconvinced by this. Yes we want host urls together so that we > can easily do polite fetching from individual hosts. But would it make > statistics tools easier? Maybe i don't know enough about hbase to be > sure. > > This is not about polite fetching but let's say you want to count the number of fetched URLs from host foo.com. All you would have to do is to execute a scan (in hbase lingo, in gora these are called queries), between the start of foo.com and end of it. Since all URLs within a host are stored together, you do not have to go over the entire table to compute these statistics. Makes sense? > > Writing a MapReduce job that uses Gora for storage does not take much > > effort. > > > This was confusing me. I thought that using Gora meant that we were > losing the benefits of hdfs. So if we run a map reduce job over > machines which are also HBase nodes does their input come from the > hbase data stored on those nodes to reduce internal network traffic? > > IIRC, we do that in Gora already. But even if we don't (which means we forgot to do it), using Gora means that you deal with straightforward java objects and Gora figures out what to store and where. As I said, your data can also be in sql, cassandra, etc. I guess part of the confusion is that the project was called nutchbase (hence, implying it is about tying nutch into hbase). But it was just a stupid name I made up :). It is actually about abstracting storage code away from nutch. Hope this makes it clear. > > > specifies all the fields that this job will be reading. If plugins will > run > > during a job (for example, during ParserJob, several plugins will be > > active), then before job is run, those plugins must be initialized and > > FieldPluggable#getFields must be called for all plugins to figure out > which > > fields they want to read. > > I'm not sure I understand how plugins are going to change in nutch version > 2. > Will everything need to be rewritten as the API wont look anything > like the same? > They will need to manipulate WebObject and not ParseData or CrawlDatum? > > Yes, that's the point. Consider these two methods for scoring filters: public void passScoreBeforeParsing(Text url, CrawlDatum datum, Content content) throws ScoringFilterException; public void passScoreAfterParsing(Text url, Content content, Parse parse) throws ScoringFilterException; Their only purpose is, as their name implies, to pass data to and from parse phase. This is because, during parse we do not read crawl_generate dir thus we do not have access to the crawl datum object (hence, you read something from crawl datum, pass it onto Content's metadata, read it there again, pass it onto parse's metadata, etc). Instead of all that, a plugin now simply specifies what it wants to read. Thus most plugins just need two arguments: Reversed URL and WebPage object. > > Even though some of these objects are still there, most of the > CrawlDatum, > > Content, ParseData or similar objects are removed (or are slated to be > > removed). In most cases, plugins will simply take (String key, WebPage > page) > > as arguments and modify WebPage object in-place. > > This sort of makes sense, but worries me. It means more work for > people to change their plugins. Can everyone think about how to make > this as easy as possible? > > > 3) Jobs > > Nutchbase uses the concept of a marker to identify what has been > processed > > and what will be processed from now on. WebPage object contains a > > map<String, String> called markers. For example, when GeneratorJob > generates > > a URL, it puts a unique string and a unique crawl id to this marker map. > > Then FetcherJob only fetches a given URL if it contains this marker. At > the > > end of the crawl cycle, DbUpdaterJob clears all markers (except markers > > placed by IndexerJob). > > OK, I think I understood that but we need more detail. Can we have > multiple crawls going on at once? Can we have parsing and other > processing going on while there are crawls? > > I am assuming that almost every job has a job id and so multiple jobs > can be run simultaneously. > > Yes, furthermore, at the end of GeneratorJob, it gives you a crawl id (which is just a string). Every URL that has been generated at that cycle is uniquely identified by that crawl id. So you can run two generates (thus, have two different crawl ids), fetch them together, parse them separate, etc... > > a) InjectorJob: > > simples > > > b) GeneratorJob: GeneratorJob is similar to what it was before, but it is > > now a single job. During map phase: > > if FetchSchedule indicates given URL is to be fetched then, > > Calculate generator score using scoring filters > > Output <SelectorEntry<URL, score>, WebPage> > > I was a bit worried about this because it meant fetchschedule can't > know anything about the score. but maybe that is because I am putting > too much work onto the fetchschedule. > > > If a fetch schedule wants to know about score, all it has to do is to ask for "score" field :) Calculating generator score using scoring filters is after fetch schedule step, because, well, it is already like that in current nutch design :) > > > Then URLPartitioner partitions according to choices specified in config > > files (by host, by ip or by domain). > > eh? don't understand. > > Again, this is just like current nutch design. We want all URLs in a single host (or domain, or IP depending on your config settings) want to end up in the same reducer. Partitioners ensure that. > > Reduce phase counts URL according to topN and host/domain limits and > marks > > all URLs if limits are not yet reached. > > ?? > Yeah, this part is not clear at all :) You can tell generate to generate at most 10 URLs per host. Reduce phase counts the number of URLs that have been generated from that host so far, and if the limit is exceeded, then no more URLs from that host are generated. > > > GeneratorJob markes all generated URLs with a unique crawl id. Fetcher, > > parser and indexer jobs can then use this crawl id to process only URLs > that > > are in that particular crawl cycle. Alternatively, they can also work on > all > > URLs (though, again, FetcherJob will only work on a URL which has been > > marked by GenerateJob. So even if FetcherJob is instructed to work on > "all" > > URLs, it will skip those without generator markers). > > OK, I would like a bit more explanation of this :-) > > I tried to give an explanation of crawl ids above. Did that make it clear? Here is an example, you have 3 URLs: http://a.com/, http://b.com/, http://c.com/. You run two generates consecutively (but let's say, you change URL filters between generates). In the first run, http://a.com/ is generated and marked with a crawl id, say, "X". Then, in the second run, http://b.com/is generated and marked with crawl id, "Y". Third URL is not generated in any of these runs. Now you can tell fetcher to fetch URLs that have crawl id "X", or "Y", or simply say "fetch all URLs that have been marked. I don't care about crawl ids". > > > GeneratorJob will also print crawl id to console. > > c) FetcherJob: During map phase, FetcherJob outputs <Integer, > > FetcherEntry<key, WebPage>> pairs. The first member of the pair (integer) > is > > a random integer between 0 and 65536. After map, these pairs are > partitioned > > into reduces according to URL's hosts. Reduce works exactly like old > > Fetcher's map. FetcherJob can also continue interrupted fetches now (by > > giving "-continue" switch on command line). > > How do we stop fetches to interrupt them? > > Ctrl+c or killing the job in JobTracker's UI. It may also be that fetcher may fail somehow. If it does, you do not lose your entire fetch. > > The random integer may seem pointless but it is actually quite important > for > > performance. Let's say instead of a random integer we were to just output > > <key, WebPage> pairs from map. Again, let's say you have 1000 URLs from > host > > a.com, and 500 URLs from b.com. Let's also say that we have 10 fetcher > > threads so maximum queue size is 500. In this case, since all URLs from > host > > a.com will be processed by reduce before all URLs from b.com, during > reduce > > phase, only one thread will fetch URLs from a.com while every other > thread > > will be spin-waiting. However, with randomization, URLs from a.com and > b.com > > will be processed in a random order thus bandwidth utilization will be > > higher. > > I'd like to see a lot more detail about this optimization. I think you > are right, but I don't yet fully understand where the data resides in > hadoop, and what order things gets done in. > > But the point is you do not need to know that :) Data is stored as <reversed URL, WebPage> pairs and you get them in key-lexicographic order during MapReduce. The last thing I say may not be completely true :) For now, all gora backends indeed return data to you in sorted order, but maybe other backends may not do that in the future. > > > d) ParserJob: ParserJob is straightforward. It is only a map (i.e., has 0 > > reducers). Simple parses all URLs with active parse plugins. > > Can you explain that last sentence some more? How do we decide which > urls need to be parsed? are they not labelled with a jobid, or crawl > id? > > ParserJob will only attempt to parse a URL if it has been marked by Fetcher. Other than that, the same logic applies (so you can say "parse all urls with crawl id X, or Y, or parse all urls that were marked by fetcher). > > e) DbUpdaterJob: This is a combination of updatedb and invertlinks jobs. > If > > a URL is successfully parsed (which means, it will contain a parse > marker), > > parse marker? > > > DbUpdaterJob will put its own marker. Note: It may make more sense to put > a > > marker even if a URL is not successfully parsed. DbUpdaterJob also cleans > > all other markers. > > > f) IndexerJob: Goes over all URLs with a db update marker (again, you can > > specify ALL URLs with update markers, or a crawl id), and indexes them. > > db update marker? > > Yes, each job will put its own marker to the URL. This indicates that next step can process it. > > 4) What's missing > > Most of the core functionality and plugins have been ported. > > Oh cool. That worried me earlier. > > > However, some > > tools and features are still missing: arc segment tools, PageRank > scoring, > > field indexing API, etc.... > > -- > > Doğacan Güney > > > > > > THANKS. > -- Doğacan Güney

