Re: [Nutch-dev] How to modify crawldb values

Doğacan Güney Tue, 23 Jan 2007 07:06:56 -0800

Hi,

Armel T. Nene wrote:
> Hi guys,
>
>  
>
> I want to extend Nutch to use real-time indexing on local file system. I
> have been through the source code to find out ways to modify values stored
> in CrawlDB. The idea is simple:
>
>  
>
> I have an external program (or a script) which checks for changes in a
> directory (url injected in the crawldb). When there are new changes
> recorded, the program will update the status in the crawldb and generate a
> new fetch list for the fetcher to fetch. I do not want to make great changes
> to the nutch source code as I want the program to be compatible with future
> releases. Now, I know the crawldatum is saved in the crawldb with the url. I
> am not too sure but I think the url is the key to retrieve the crawldatum.
> For my program to work successfully, I need to know the following:
>
>  
>
> *         How to read data from the crawldb; what data structure does it use
> and how to referenced to it?
>


Crawldb is essentially a list of <url, CrawlDatum> pairs and is stores 
as a MapFile. So you can read it with MapFile.Reader.get.
> *         How to write back to the crawldb; updating information back to the
> crawldb or probably creating a new with changed and unchanged values.
>   
Current FS implementation is write-once, so you can't modify it. But you 
can read it one-by-one(possibly with MapFile.Reader.next) and then write 
a new one with MapFile.Writer.

>  
>
> This is an extract from the crawldb:
>
>  
>
> http://some-url.com/    Version: 4
>
> Status: 2 (DB_fetched)
>
> Fetch time: Thu Feb 22 12:44:05 GMT 2007
>
> Modified time: Thu Jan 01 01:00:00 GMT 1970
>
> Retries since fetch: 0
>
> Retry interval: 30.0 days
>
> Score: 1.0323955
>
> Signature: f4c14c46074b66aad8829b8aa84cd636
>
> Metadata: null
>
>  
>
> How can get this information with an external program and modify/ update it.
> Once I know how to implement that part, I can call nutch in the usual way of
> generate - fetch - updatedb - updatelinkdb -index -etc.. so generate will
> have the new value that I want re-indexed. This will stop the fetcher from
> fetching a long list of urls (changed or unchanged but need fetching because
> of their next_fetch_time is due). The program gets its update from the
> underlying OS to know notify about any changes to files and folders being
> monitored. Once the program is working with sufficient tests, I will be
> willing to share the source code; it's written in java and doesn't need any
> script to launch nutch.
>
>  
>
> I will be looking forward to your kind support.
>
>  
>
> Armel
>
>  
>
> -------------------------------------------------
>
> Armel T. Nene
>
> iDNA Solutions
>
> Tel: +44 (207) 257 6124
>
> Mobile: +44 (788) 695 0483 
>
>  <http://blog.idna-solutions.com/> http://blog.idna-solutions.com
>
>  
>
>
>   


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys - and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Re: [Nutch-dev] How to modify crawldb values

Reply via email to