hi Andrezj:

That is exactly what I try to implement! I guess the
patch is not included in new nutch 07, right? coz at
least, I didn't find 
"src/java/org/apache/nutch/db/FetchSchedule.java" 
in SVN source code;

I will try to embed the patch code by myself and test
the it.

thanks,

Michael Ji,


--- Andrzej Bialecki <[EMAIL PROTECTED]> wrote:

> Michael Ji wrote:
> > Hi Jon:
> > 
> > You have an interesting approach. 
> > 
> > We are in the similar effort to save the
> unneccessary
> > indexing and data duplication for the pages with
> the
> > same content since last successful fetching. 
> > 
> > I am thinking to add an extra data field in
> > "fetchlist" data structure, which contained 
> content
> > MD5 hashing value for the previous fetching.
> > 
> > If the current fetching step gets same content, I
> will
> > skip parsing and indexing process.
> 
> Please see the patches in
> http://issues.apache.org/jira/browse/NUTCH-61 .
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
>   ___. ___ ___ ___ _ _  
> __________________________________
> [__ || __|__/|__||\/|  Information Retrieval,
> Semantic Web
> ___|||__||  \|  ||  |  Embedded Unix, System
> Integration
> http://www.sigram.com  Contact: info at sigram dot
> com
> 
> 



                
____________________________________________________
Start your day with Yahoo! - make it your home page 
http://www.yahoo.com/r/hs 
 

Reply via email to