It is a bit convoluted at best. I found out that the links and their meta data are stored in the crawldb directory, and the actual raw http contents of the links are stored in the different segments.
The crawldb and the segments are MapFiles or SequenceFiles I think. So, you could use a MapFileReader or SequenceFileReader to read them and dump them out whatever format you like. However, so far I haven't figured out how to associate the crawldb links with their contents. For example, while looping through the crawldb links, for each link, I want to find its raw http content. But, I don't know how to do it yet. That said, it is possible to dump out the two into a MySql database and they are all keyed on the link/url. But that means, you need to write to the MySql database twice for each url. Which is not good for performance reasons. That's why I am sticking to my own crawler for now and it works very good for me. Take a look at www.coolposting.com, where it searches for multiple forums. The crawler behind that is the one I wrote based on Nutch architecture and storing into MySql for each url content. If I want to open source my crawler, I will need to add some licensing terms to the code first before releasing it onto www.jiansnet.com. Anyway, I will make the crawler available soon, one way or the other (open source, closed source but free download, etc.) Cheers, Jian On Nov 27, 2007 2:20 PM, Cool Coder <[EMAIL PROTECTED]> wrote: > Hello, > I am just wondering how can I read crawldb and get content of > each stored URL. I am not sure whether this can be possible or not. > > - BR > > > --------------------------------- > Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try > it now. >
