Doğacan

2010/7/3 Doğacan Güney <[email protected]>:
> I am attaching first draft of a complete nutchbase design document. There
> are parts missing and parts not yet explained clearly but I would like to
> get everyone's opinion on what they think so far.

Thanks. I read your design and found it quite clear - at least to this
non committer. :-)
I would suggest that we should take this opportunity to do a full
design document including design which has not changed from v1 to v2.
So more please!

I have made the odd comment which was then explained later on. I guess
that meant I was a bit confused :-)

> Please let me
> know which parts are unclear, which parts make no sense etc, and I will
> improve the draft.


The main thing I missed was any kind of overview of the data flow. I'd
like to see a description of how a url/webpage goes through our system
from being unknown, injected or discovered, to being queued for
fetching in the generator, to fetched, parsed, fetchscheduled, scored,
and generated again.
Plus of course, indexed by sending to solr and being seen by an end
user application.

At each stage I'd like to see where the data is stored (text file,
hbase, solr) and especially how this differs from the previous (text
file, nutch crawldb, solr)

I know that some of this may sound like a tutorial, but it is worth
doing now rather than putting it off until later.
 
-----------------------------------------------------------------------------------------------------------------------------------------
> Nutchbase
> =========
> 1) Rationale
> * All your data in a central location (at least, nutch gives you the
> illusion of a centralized storage)

But hbase is distributed across your hadoop cluster, right? This is
the "illusion" you meant.


> 2) Design
> As mentioned above, all data for a URL is stored in a WebPage object. This
> object is accessed by a key that is the reverse form of a URL. For example,
>     http://bar.foo.com:8983/to/index.html?a=b becomes
> com.foo.bar:http:8983/to/index.html?a=b

This was clear and is the main point to convey :-) I would in fact
like loads more info on the WebPage object.

> If URLs are stored lexicographically, this means that URLs from same domain
> and host are stored closer together. This will hopefully make developing
> statistics tools easier for hosts and domains.


I am unconvinced by this. Yes we want host urls together so that we
can easily do polite fetching from individual hosts. But would it make
statistics tools easier? Maybe i don't know enough about hbase to be
sure.

> Writing a MapReduce job that uses Gora for storage does not take much
> effort.


This was confusing me. I thought that using Gora meant that we were
losing the benefits of hdfs. So if we run a map reduce job over
machines which are also HBase nodes does their input come from the
hbase data stored on those nodes to reduce internal network traffic?


> specifies all the fields that this job will be reading. If plugins will run
> during a job (for example, during ParserJob, several plugins will be
> active), then before job is run, those plugins must be initialized and
> FieldPluggable#getFields must be called for all plugins to figure out which
> fields they want to read.

I'm not sure I understand how plugins are going to change in nutch version 2.
Will everything need to be rewritten as the API wont look anything
like the same?
They will need to manipulate WebObject and not ParseData or CrawlDatum?

> Even though some of these objects are still there, most of the CrawlDatum,
> Content, ParseData or similar objects are removed (or are slated to be
> removed). In most cases, plugins will simply take (String key, WebPage page)
> as arguments and modify WebPage object in-place.

This sort of makes sense, but worries me. It means more work for
people to change their plugins. Can everyone think about how to make
this as easy as possible?

> 3) Jobs
> Nutchbase uses the concept of a marker to identify what has been processed
> and what will be processed from now on. WebPage object contains a
> map<String, String> called markers. For example, when GeneratorJob generates
> a URL, it puts a unique string and a unique crawl id to this marker map.
> Then FetcherJob only fetches a given URL if it contains this marker. At the
> end of the crawl cycle, DbUpdaterJob clears all markers (except markers
> placed by IndexerJob).

OK, I think I understood that but we need more detail. Can we have
multiple crawls going on at once? Can we have parsing and other
processing going on while there are crawls?

I am assuming that almost every job has a job id and so multiple jobs
can be run simultaneously.

> a) InjectorJob:

simples

> b) GeneratorJob: GeneratorJob is similar to what it was before, but it is
> now a single job. During map phase:
>     if FetchSchedule indicates given URL is to be fetched then,
>         Calculate generator score using scoring filters
>         Output <SelectorEntry<URL, score>, WebPage>

I was a bit worried about this because it meant fetchschedule can't
know anything about the score. but maybe that is because I am putting
too much work onto the fetchschedule.



> Then URLPartitioner partitions according to choices specified in config
> files (by host, by ip or by domain).

eh? don't understand.

> Reduce phase counts URL according to topN and host/domain limits and marks
> all URLs if limits are not yet reached.

??

> GeneratorJob markes all generated URLs with a unique crawl id. Fetcher,
> parser and indexer jobs can then use this crawl id to process only URLs that
> are in that particular crawl cycle. Alternatively, they can also work on all
> URLs (though, again, FetcherJob will only work on a URL which has been
> marked by GenerateJob. So even if FetcherJob is instructed to work on "all"
> URLs, it will skip those without generator markers).

OK, I would like a bit more explanation of this :-)


> GeneratorJob will also print crawl id to console.
> c) FetcherJob: During map phase, FetcherJob outputs <Integer,
> FetcherEntry<key, WebPage>> pairs. The first member of the pair (integer) is
> a random integer between 0 and 65536. After map, these pairs are partitioned
> into reduces according to URL's hosts. Reduce works exactly like old
> Fetcher's map. FetcherJob can also continue interrupted fetches now (by
> giving "-continue" switch on command line).

How do we stop fetches to interrupt them?

> The random integer may seem pointless but it is actually quite important for
> performance. Let's say instead of a random integer we were to just output
> <key, WebPage> pairs from map. Again, let's say you have 1000 URLs from host
> a.com, and 500 URLs from b.com. Let's also say that we have 10 fetcher
> threads so maximum queue size is 500. In this case, since all URLs from host
> a.com will be processed by reduce before all URLs from b.com, during reduce
> phase, only one thread will fetch URLs from a.com while every other thread
> will be spin-waiting. However, with randomization, URLs from a.com and b.com
> will be processed in a random order thus bandwidth utilization will be
> higher.

I'd like to see a lot more detail about this optimization. I think you
are right, but I don't yet fully understand where the data resides in
hadoop, and what order things gets done in.


> d) ParserJob: ParserJob is straightforward. It is only a map (i.e., has 0
> reducers). Simple parses all URLs with active parse plugins.

Can you explain that last sentence some more? How do we decide which
urls need to be parsed? are they not labelled with a jobid, or crawl
id?

> e) DbUpdaterJob: This is a combination of updatedb and invertlinks jobs. If
> a URL is successfully parsed (which means, it will contain a parse marker),

parse marker?

> DbUpdaterJob will put its own marker. Note: It may make more sense to put a
> marker even if a URL is not successfully parsed. DbUpdaterJob also cleans
> all other markers.

> f) IndexerJob: Goes over all URLs with a db update marker (again, you can
> specify ALL URLs with update markers, or a crawl id), and indexes them.

db update marker?

> 4) What's missing
> Most of the core functionality and plugins have been ported.

Oh cool. That worried me earlier.

> However, some
> tools and features are still missing: arc segment tools, PageRank scoring,
> field indexing API, etc....
> --
> Doğacan Güney
>
>

THANKS.

Reply via email to