Re: extracting urls into text files

Enis Soztutar Fri, 16 Mar 2007 05:16:22 -0800

cha wrote:

hi sagar,


Thanks for the reply.

Actually am trying to digg out the code in the same class..but not able to
figure it out from where Urls has been read.

When you dump the database, the file contains :

http://blog.cha.com/    Version: 4
Status: 2 (DB_fetched)
Fetch time: Fri Apr 13 15:58:28 IST 2007
Modified time: Thu Jan 01 05:30:00 IST 1970
Retries since fetch: 0
Retry interval: 30.0 days
Score: 0.062367838
Signature: 2b4e94ff83b8a4aa6ed061f607683d2e
Metadata: null

I figured it out rest of the things but not sure how the Url name has been
read..

I just want plain urls only  in the text file..It is possible that i can use
to write url in some xml formats..If yes then how?

Awaiting,

Chandresh

Hi, crawldb is a actually a map file, which has urls as keys(Text class)and CrawlDatum objects as values. You can write a generic map filereader and then which extracts the keys and dumps to a file.

Re: extracting urls into text files

Reply via email to