Michael Ji wrote:
"
FetchListEntry value = new FetchListEntry();
Page page = (Page)value.getPage().clone();
"
Seems value is an empty FetchListEntry instance. Will
that cause clone getPage failure coz it is NULL?
Please try to replace this logic with the following:
FetchListEntry value = new FetchListEntry();
while (topN > 0 && reader.next(key, value)) {
Page page = value.getPage();
if (page != null) {
Page p = new Page();
p.set(page);
page = p;
}
if (forceRefetch) {
Page p = value.getPage();
// reset fetchTime and MD5, so that the content will
// always be new and unique.
p.setNextFetchTime(0L);
p.setMD5(MD5Hash.digest(p.getURL().toString()));
}
tables.append(value);
topN--;
This patchset still needs a lot of thought and work. Even the part that
avoids re-fetching unmodified content needs additional thinking - it's
easy to end up in a state, where Nutch cannot be forced to re-fetch the
page because every time you try it remains unmodified - but you need
refetching the actual data because e.g. you lost that segment data...
--
Best regards,
Andrzej Bialecki <><
___. ___ ___ ___ _ _ __________________________________
[__ || __|__/|__||\/| Information Retrieval, Semantic Web
___|||__|| \| || | Embedded Unix, System Integration
http://www.sigram.com Contact: info at sigram dot com