Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

wangxu Tue, 17 Apr 2007 20:42:45 -0700

Andrzej Bialecki wrote:

Howie Wang wrote:
I definitely don't expect people to write it just because it happens
to be useful to me :-)  Call me crazy, but I'm thinking of
implementing  this when I get some free time (whenever that will be).
It seems that I  would just need to implement IWebDBWriter and
IWebDBReader, and  then add a command line option to the tools
(something like -mysql) to  specify the type of db to instantiate. It
would affect about 15 files, but  the tools changes would be simple
-- a few if statements here and there. Does that sound right?  Howie
You are talking about the codebase from branch 0.7. This branch is notunder active development. The current codebase is very different - ituses the MapReduce framework to process data in a distributed fashion.
So, there is no single interface for writing the CrawlDb. There is oneclass for reading the CrawlDb, but usually the data in the DB is usednot standalone, but as one of many inputs to a map-reduce job.
To summarize - I think it would be very difficult to do this with thecurrent codebase.

My urls are at most at the level of 1000,000 per site;
Perhaps I can do some tests and go on with the idea.

Based on 0.9 It seems the most simplest way to achieve it is like this,

To do any mapReduce job associated with Crawldb,I add operations likethese:

Read the rationalDB to generate a tmp CrawlDB as the crawlDB inputPath;
Read the job-generated CrawlDb to update the RationalDB.
Is that right?

Re: Have anybody thought of replacing CrawlDb with any kind of Rational DB?

Reply via email to