wangxu wrote: > Andrzej Bialecki wrote: >> Howie Wang wrote: >>> I definitely don't expect people to write it just because it happens >>> to be useful to me :-) Call me crazy, but I'm thinking of >>> implementing this when I get some free time (whenever that will be). >>> It seems that I would just need to implement IWebDBWriter and >>> IWebDBReader, and then add a command line option to the tools >>> (something like -mysql) to specify the type of db to instantiate. It >>> would affect about 15 files, but the tools changes would be simple >>> -- a few if statements here and there. Does that sound right? Howie >> >> You are talking about the codebase from branch 0.7. This branch is not >> under active development. The current codebase is very different - it >> uses the MapReduce framework to process data in a distributed fashion. >> >> So, there is no single interface for writing the CrawlDb. There is one >> class for reading the CrawlDb, but usually the data in the DB is used >> not standalone, but as one of many inputs to a map-reduce job. >> >> To summarize - I think it would be very difficult to do this with the >> current codebase. >> > My urls are at most at the level of 1000,000 per site; > Perhaps I can do some tests and go on with the idea. > > Based on 0.9 It seems the most simplest way to achieve it is like this, > To do any mapReduce job associated with Crawldb,I add operations like > these: > Read the rationalDB to generate a tmp CrawlDB as the crawlDB inputPath; > Read the job-generated CrawlDb to update the RationalDB. > Is that right?
Yes, it should be possible, you just need to keep track of the page statuses and metadata that is normally kept in Crawldb. Also, if you want to update the relational DB in a map-reduce job you need to be careful about opening new connections to the DB - best set up the connection in Mapper/Reducer configure() methods. -- Best regards, Andrzej Bialecki <>< ___. ___ ___ ___ _ _ __________________________________ [__ || __|__/|__||\/| Information Retrieval, Semantic Web ___|||__|| \| || | Embedded Unix, System Integration http://www.sigram.com Contact: info at sigram dot com ------------------------------------------------------------------------- This SF.net email is sponsored by DB2 Express Download DB2 Express C - the FREE version of DB2 express and take control of your XML. No limits. Just data. Click to get it now. http://sourceforge.net/powerbar/db2/ _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers