hi Andrezj: That is exactly what I try to implement! I guess the patch is not included in new nutch 07, right? coz at least, I didn't find "src/java/org/apache/nutch/db/FetchSchedule.java" in SVN source code;
I will try to embed the patch code by myself and test the it. thanks, Michael Ji, --- Andrzej Bialecki <[EMAIL PROTECTED]> wrote: > Michael Ji wrote: > > Hi Jon: > > > > You have an interesting approach. > > > > We are in the similar effort to save the > unneccessary > > indexing and data duplication for the pages with > the > > same content since last successful fetching. > > > > I am thinking to add an extra data field in > > "fetchlist" data structure, which contained > content > > MD5 hashing value for the previous fetching. > > > > If the current fetching step gets same content, I > will > > skip parsing and indexing process. > > Please see the patches in > http://issues.apache.org/jira/browse/NUTCH-61 . > > > -- > Best regards, > Andrzej Bialecki <>< > ___. ___ ___ ___ _ _ > __________________________________ > [__ || __|__/|__||\/| Information Retrieval, > Semantic Web > ___|||__|| \| || | Embedded Unix, System > Integration > http://www.sigram.com Contact: info at sigram dot > com > > ____________________________________________________ Start your day with Yahoo! - make it your home page http://www.yahoo.com/r/hs
