so I rewrote the begingin of the index in IndexUtil:

public NutchDocument index(String key, WebPage page) {
    NutchDocument doc = new NutchDocument();
    LOG.info("key: " + key);
    doc.add("id", key);
    doc.add("digest", StringUtil.toHexString(page.getSignature().array()));
    //doc.add("batchId", page.getBatchId().toString());

  if( null == page.getBatchId()) {
        LOG.info("batchId is null " );
  } else {
         doc.add("batchId", page.getBatchId().toString());
  }
  try {
      LOG.info("page is:"+page);
  }catch(Exception e){
      LOG.info("error:"+e);
  }


and here is an example of what I am getting:

key: com.nba.www:http/
batchId is null
page is:org.apache.nutch.storage.WebPage@ba6fc739 {
  "baseUrl":"null"
  "status":"0"
  "fetchTime":"1366846228423"
  "prevFetchTime":"0"
  "fetchInterval":"0"
  "retriesSinceFetch":"0"
  "modifiedTime":"0"
  "prevModifiedTime":"0"
  "protocolStatus":"null"
  "content":"null"
  "contentType":"text/html"
  "prevSignature":"null"
  "signature":"java.nio.HeapByteBuffer[pos=0 lim=16 cap=16]"
  "title":"NBA.com"
"text":"NBA.com Skip to ...., part of the Turner Sports & Entertainment Digital Network."
  "parseStatus":"org.apache.nutch.storage.ParseStatus@7821 {
  "majorCode":"1"
  "minorCode":"0"
  "args":"[]"
}"
  "score":"1.0"
  "reprUrl":"null"
"headers":"{Content-Encoding=gzip, Connection=close, Content-Type=text/html;charset=UTF-8, Content-Length=19526, Cache-Control=max-age=31, Date=Wed, 03 Apr 2013 23:30:28 GMT, Expires=Wed, 03 Apr 2013 23:30:59 GMT, Server=nginx, X-UA-Device=desktop, Vary=User-Agent, X-UA-Profile=desktop}"
  "outlinks":"{}"
  "inlinks":"{}"
"markers":"{dist=0, _injmrk_=y, _idxmrk_=1365031584-1270211116, _updmrk_=1365031584-1270211116}"
  "metadata":"{}"
  "batchId":"null"
}

the fileds: baseUrl, protoclolStatus, reprUrl, batchId are null and the outlinks is empty. I am still in the process of familiarizing myself with code, so I can't say it for sure, and I apologize for asking stupid questions while we are at it, but this doesn't seem right to me, am i right to assume that the mentioned fields or at least most of them should have values?

also, the example that I am showing here is not a one off, these fields have the same value for all, emphasis on ALL, the a few thousands urls that I have fetched and with which I am playing to test the code.

the filed text was a lot longer, I removed the extra text since it was irreverent here, everything else I copied directly from the log file.

thanks,



On 04/03/2013 02:32 PM, Lewis John Mcgibbney wrote:
Hi Kaveh,

On Wed, Apr 3, 2013 at 1:30 PM, <[email protected]
<mailto:[email protected]>> wrote:

    Hi

    so I am not sure if binoy is talking about this but here it is:

    the original exception comes from
    src/java/org/apache/nutch/__indexer/IndexUtil.java  line 66

      public NutchDocument index(String key, WebPage page) {
         NutchDocument doc = new NutchDocument();
         doc.add("id", key);
         doc.add("digest",
    StringUtil.toHexString(page.__getSignature().array()));
    ==>>    doc.add("batchId", page.getBatchId().toString());

    page.getBatchId() returns null for every urls. my guess is that
    updatedb removes the batchID from the rows in webpage since the
    generate and fetch work fine with batchId but after the updatedb (
    which by the way does not accept batchId as one of its parameter
    which means that it is going over the entire webpage table everytime
    you run it, but that is a different issue) solrindex can't find the
    batchIds

I've reopened NUTCH-1532 and attached a trivial patch which should now
protect against the NPE people have been getting.
Can you please check it out and get back to us?
Thank you Kaveh

--
Kaveh Minooie

Reply via email to