Andrzej Bialecki wrote:
Howie Wang wrote:
I definitely don't expect people to write it just because it happens
to be useful to me :-)  Call me crazy, but I'm thinking of
implementing  this when I get some free time (whenever that will be).
It seems that I  would just need to implement IWebDBWriter and
IWebDBReader, and  then add a command line option to the tools
(something like -mysql) to  specify the type of db to instantiate. It
would affect about 15 files, but  the tools changes would be simple
-- a few if statements here and there. Does that sound right?  Howie

You are talking about the codebase from branch 0.7. This branch is not under active development. The current codebase is very different - it uses the MapReduce framework to process data in a distributed fashion.

So, there is no single interface for writing the CrawlDb. There is one class for reading the CrawlDb, but usually the data in the DB is used not standalone, but as one of many inputs to a map-reduce job.

To summarize - I think it would be very difficult to do this with the current codebase.

My urls are at most at the level of 1000,000 per site;
Perhaps I can do some tests and go on with the idea.

Based on 0.9 It seems the most simplest way to achieve it is like this,
To do any mapReduce job associated with Crawldb,I add operations like these:
Read the rationalDB to generate a tmp CrawlDB as the crawlDB inputPath;
Read the job-generated CrawlDb to update the RationalDB.
Is that right?

Reply via email to