Re: Nutch2.x Null Pointer Exception in IndexerJob.Java for a fresh crawl with One Seed

kaveh minooie Wed, 03 Apr 2013 18:55:46 -0700

so I rewrote the begingin of the index in IndexUtil:

public NutchDocument index(String key, WebPage page) {
    NutchDocument doc = new NutchDocument();
    LOG.info("key: " + key);
    doc.add("id", key);
    doc.add("digest", StringUtil.toHexString(page.getSignature().array()));
    //doc.add("batchId", page.getBatchId().toString());


  if( null == page.getBatchId()) {
        LOG.info("batchId is null " );
  } else {
         doc.add("batchId", page.getBatchId().toString());
  }
  try {
      LOG.info("page is:"+page);
  }catch(Exception e){
      LOG.info("error:"+e);
  }


and here is an example of what I am getting:

key: com.nba.www:http/
batchId is null
page is:org.apache.nutch.storage.WebPage@ba6fc739 {
  "baseUrl":"null"
  "status":"0"
  "fetchTime":"1366846228423"
  "prevFetchTime":"0"
  "fetchInterval":"0"
  "retriesSinceFetch":"0"
  "modifiedTime":"0"
  "prevModifiedTime":"0"
  "protocolStatus":"null"
  "content":"null"
  "contentType":"text/html"
  "prevSignature":"null"
  "signature":"java.nio.HeapByteBuffer[pos=0 lim=16 cap=16]"
  "title":"NBA.com"

"text":"NBA.com Skip to ...., part of the Turner Sports &Entertainment Digital Network."

  "parseStatus":"org.apache.nutch.storage.ParseStatus@7821 {
  "majorCode":"1"
  "minorCode":"0"
  "args":"[]"
}"
  "score":"1.0"
  "reprUrl":"null"

"headers":"{Content-Encoding=gzip, Connection=close,Content-Type=text/html;charset=UTF-8, Content-Length=19526,Cache-Control=max-age=31, Date=Wed, 03 Apr 2013 23:30:28 GMT,Expires=Wed, 03 Apr 2013 23:30:59 GMT, Server=nginx,X-UA-Device=desktop, Vary=User-Agent, X-UA-Profile=desktop}"

  "outlinks":"{}"
  "inlinks":"{}"

"markers":"{dist=0, _injmrk_=y, _idxmrk_=1365031584-1270211116,_updmrk_=1365031584-1270211116}"

  "metadata":"{}"
  "batchId":"null"
}

the fileds: baseUrl, protoclolStatus, reprUrl, batchId are null and theoutlinks is empty. I am still in the process of familiarizing myselfwith code, so I can't say it for sure, and I apologize for asking stupidquestions while we are at it, but this doesn't seem right to me, am iright to assume that the mentioned fields or at least most of themshould have values?

also, the example that I am showing here is not a one off, these fieldshave the same value for all, emphasis on ALL, the a few thousands urlsthat I have fetched and with which I am playing to test the code.

the filed text was a lot longer, I removed the extra text since it wasirreverent here, everything else I copied directly from the log file.


thanks,



On 04/03/2013 02:32 PM, Lewis John Mcgibbney wrote:

Hi Kaveh,

On Wed, Apr 3, 2013 at 1:30 PM, <[email protected]
<mailto:[email protected]>> wrote:

    Hi

    so I am not sure if binoy is talking about this but here it is:

    the original exception comes from
    src/java/org/apache/nutch/__indexer/IndexUtil.java  line 66

      public NutchDocument index(String key, WebPage page) {
         NutchDocument doc = new NutchDocument();
         doc.add("id", key);
         doc.add("digest",
    StringUtil.toHexString(page.__getSignature().array()));
    ==>>    doc.add("batchId", page.getBatchId().toString());

    page.getBatchId() returns null for every urls. my guess is that
    updatedb removes the batchID from the rows in webpage since the
    generate and fetch work fine with batchId but after the updatedb (
    which by the way does not accept batchId as one of its parameter
    which means that it is going over the entire webpage table everytime
    you run it, but that is a different issue) solrindex can't find the
    batchIds

I've reopened NUTCH-1532 and attached a trivial patch which should now
protect against the NPE people have been getting.
Can you please check it out and get back to us?
Thank you Kaveh


--
Kaveh Minooie

Re: Nutch2.x Null Pointer Exception in IndexerJob.Java for a fresh crawl with One Seed

Reply via email to