Re: How to read crawldb

jian chen Tue, 27 Nov 2007 14:32:36 -0800

It is a bit convoluted at best.

I found out that the links and their meta data are stored in the crawldb
directory, and the actual raw http contents of the links are stored in the
different segments.

The crawldb and the segments are MapFiles or SequenceFiles I think. So, you
could use a MapFileReader or SequenceFileReader to read them and dump them
out whatever format you like.

However, so far I haven't figured out how to associate the crawldb links
with their contents. For example, while looping through the crawldb links,
for each link, I want to find its raw http content. But, I don't know how to
do it yet.

That said, it is possible to dump out the two into a MySql database and they
are all keyed on the link/url. But that means, you need to write to the
MySql database twice for each url. Which is not good for performance
reasons.

That's why I am sticking to my own crawler for now and it works very good
for me.

Take a look at www.coolposting.com, where it searches for multiple forums.
The crawler behind that is the one I wrote based on Nutch architecture and
storing into MySql for each url content.

If I want to open source my crawler, I will need to add some licensing terms
to the code first before releasing it onto www.jiansnet.com. Anyway, I will
make the crawler available soon, one way or the other (open source, closed
source but free download, etc.)

Cheers,

Jian

On Nov 27, 2007 2:20 PM, Cool Coder <[EMAIL PROTECTED]> wrote:

> Hello,
>           I am just wondering how can I read crawldb and get content of
> each stored URL. I am not sure whether this can be possible or not.
>
>  - BR
>
>
> ---------------------------------
> Be a better sports nut! Let your teams follow you with Yahoo Mobile. Try
> it now.
>

Re: How to read crawldb

Reply via email to