Michael Ji wrote:
"
FetchListEntry value = new FetchListEntry();
Page page = (Page)value.getPage().clone();
"

Seems value is an empty FetchListEntry instance. Will
that cause clone getPage failure coz it is NULL?

Please try to replace this logic with the following:

                FetchListEntry value = new FetchListEntry();
                while (topN > 0 && reader.next(key, value)) {
                  Page page = value.getPage();
                  if (page != null) {
                    Page p = new Page();
                    p.set(page);
                    page = p;
                  }
                    if (forceRefetch) {
                      Page p = value.getPage();
                      // reset fetchTime and MD5, so that the content will
                      // always be new and unique.
                      p.setNextFetchTime(0L);
                      p.setMD5(MD5Hash.digest(p.getURL().toString()));
                    }
                    tables.append(value);
                    topN--;


This patchset still needs a lot of thought and work. Even the part that avoids re-fetching unmodified content needs additional thinking - it's easy to end up in a state, where Nutch cannot be forced to re-fetch the page because every time you try it remains unmodified - but you need refetching the actual data because e.g. you lost that segment data...

--
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||\/|  Information Retrieval, Semantic Web
___|||__||  \|  ||  |  Embedded Unix, System Integration
http://www.sigram.com  Contact: info at sigram dot com

Reply via email to