so I rewrote the begingin of the index in IndexUtil:
public NutchDocument index(String key, WebPage page) {
NutchDocument doc = new NutchDocument();
LOG.info("key: " + key);
doc.add("id", key);
doc.add("digest", StringUtil.toHexString(page.getSignature().array()));
//doc.add("batchId", page.getBatchId().toString());
if( null == page.getBatchId()) {
LOG.info("batchId is null " );
} else {
doc.add("batchId", page.getBatchId().toString());
}
try {
LOG.info("page is:"+page);
}catch(Exception e){
LOG.info("error:"+e);
}
and here is an example of what I am getting:
key: com.nba.www:http/
batchId is null
page is:org.apache.nutch.storage.WebPage@ba6fc739 {
"baseUrl":"null"
"status":"0"
"fetchTime":"1366846228423"
"prevFetchTime":"0"
"fetchInterval":"0"
"retriesSinceFetch":"0"
"modifiedTime":"0"
"prevModifiedTime":"0"
"protocolStatus":"null"
"content":"null"
"contentType":"text/html"
"prevSignature":"null"
"signature":"java.nio.HeapByteBuffer[pos=0 lim=16 cap=16]"
"title":"NBA.com"
"text":"NBA.com Skip to ...., part of the Turner Sports &
Entertainment Digital Network."
"parseStatus":"org.apache.nutch.storage.ParseStatus@7821 {
"majorCode":"1"
"minorCode":"0"
"args":"[]"
}"
"score":"1.0"
"reprUrl":"null"
"headers":"{Content-Encoding=gzip, Connection=close,
Content-Type=text/html;charset=UTF-8, Content-Length=19526,
Cache-Control=max-age=31, Date=Wed, 03 Apr 2013 23:30:28 GMT,
Expires=Wed, 03 Apr 2013 23:30:59 GMT, Server=nginx,
X-UA-Device=desktop, Vary=User-Agent, X-UA-Profile=desktop}"
"outlinks":"{}"
"inlinks":"{}"
"markers":"{dist=0, _injmrk_=y, _idxmrk_=1365031584-1270211116,
_updmrk_=1365031584-1270211116}"
"metadata":"{}"
"batchId":"null"
}
the fileds: baseUrl, protoclolStatus, reprUrl, batchId are null and the
outlinks is empty. I am still in the process of familiarizing myself
with code, so I can't say it for sure, and I apologize for asking stupid
questions while we are at it, but this doesn't seem right to me, am i
right to assume that the mentioned fields or at least most of them
should have values?
also, the example that I am showing here is not a one off, these fields
have the same value for all, emphasis on ALL, the a few thousands urls
that I have fetched and with which I am playing to test the code.
the filed text was a lot longer, I removed the extra text since it was
irreverent here, everything else I copied directly from the log file.
thanks,
On 04/03/2013 02:32 PM, Lewis John Mcgibbney wrote:
Hi Kaveh,
On Wed, Apr 3, 2013 at 1:30 PM, <[email protected]
<mailto:[email protected]>> wrote:
Hi
so I am not sure if binoy is talking about this but here it is:
the original exception comes from
src/java/org/apache/nutch/__indexer/IndexUtil.java line 66
public NutchDocument index(String key, WebPage page) {
NutchDocument doc = new NutchDocument();
doc.add("id", key);
doc.add("digest",
StringUtil.toHexString(page.__getSignature().array()));
==>> doc.add("batchId", page.getBatchId().toString());
page.getBatchId() returns null for every urls. my guess is that
updatedb removes the batchID from the rows in webpage since the
generate and fetch work fine with batchId but after the updatedb (
which by the way does not accept batchId as one of its parameter
which means that it is going over the entire webpage table everytime
you run it, but that is a different issue) solrindex can't find the
batchIds
I've reopened NUTCH-1532 and attached a trivial patch which should now
protect against the NPE people have been getting.
Can you please check it out and get back to us?
Thank you Kaveh
--
Kaveh Minooie