Re: error using generate in 2.x

kaveh minooie Mon, 01 Apr 2013 14:45:52 -0700

Hi

first of all I am posting this to both user and dev list since this isbecoming a dev issue more than anything else, and it seems to me thatthis issue needs to be moved to that list, but let me know if I am wrongcause I don't want to generate any more spam than we are already gettingin those lists.

The patch NUTCH-1551 didn't solve my issue. I am still getting the sameexact error when i try to run generate. (this was run in local mode) :

2013-04-01 11:43:27,710 INFO store.HBaseStore - Keyclass and nameclassmatch but mismatching table names mappingfile schema is 'webpage' vsactual schema 't1_webpage' , assuming they are the same.2013-04-01 11:43:27,718 INFO mapreduce.GoraRecordWriter -gora.buffer.write.limit = 100002013-04-01 11:43:27,838 WARN mapred.FileOutputCommitter - Output pathis null in cleanup

2013-04-01 11:43:27,839 WARN  mapred.LocalJobRunner - job_local_0001
java.lang.NullPointerException
        at org.apache.gora.hbase.store.HBaseStore.put(HBaseStore.java:235)

atorg.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:60)atorg.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:588)atorg.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)atorg.apache.nutch.crawl.GeneratorReducer.reduce(GeneratorReducer.java:79)atorg.apache.nutch.crawl.GeneratorReducer.reduce(GeneratorReducer.java:40)

        at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)

atorg.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650)

        at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)

atorg.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)2013-04-01 11:43:28,763 ERROR crawl.GeneratorJob - GeneratorJob:java.lang.RuntimeException: job failed: name=[t1]generate:1364841802-1763249246, jobid=job_local_0001atorg.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)

        at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:193)

atorg.apache.nutch.crawl.GeneratorJob.generate(GeneratorJob.java:219)

        at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:264)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.GeneratorJob.main(GeneratorJob.java:272)

now i did a little bit of tracing and now I am not sure whether it is anutch issue or gora anymore because:

the original error (NPE) come from here (ingora/blob/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java:235)



  case MAP:
            if(o instanceof StatefulMap) {
              StatefulHashMap<Utf8, ?> map = (StatefulHashMap<Utf8, ?>) o;
              for (Entry<Utf8, State> e : map.states().entrySet()) {
                Utf8 mapKey = e.getKey();
                switch (e.getValue()) {
                  case DIRTY:
--->                byte[] qual = Bytes.toBytes(mapKey.toString());

byte[] val = toBytes(map.get(mapKey),field.schema().getValueType());

                    put.add(hcol.getFamily(), qual, val);
                    hasPuts = true;
                    break;
                  case DELETED:
                    qual = Bytes.toBytes(mapKey.toString());
                    hasDeletes = true;
                    delete.deleteColumn(hcol.getFamily(), qual);
                    break;
                }
              }
            } else {

now the likely variable that is null seems to be 'mapkey' which isprobably as a result of male formed URL ( thou I can't say that for sure )


now the put function is being called from here

this is from gora 2.1:

gora/blob/0.2.1/gora-core/src/main/java/org/apache/gora/mapreduce/GoraRecordWriter.java:


  @Override

public void write(K key, T value) throws IOException,InterruptedException {

    store.put(key, (Persistent) value);

    counter.increment();
    if (counter.isModulo()) {

LOG.info("Flushing the datastore after " +counter.getRecordsNumber() + " records");

      store.flush();
    }
  }
}


the same function in gora trunk is like this:

public void write(K key, T value) throws IOException, InterruptedException {
          try{
            store.put(key, (Persistent) value);

            counter.increment();
            if (counter.isModulo()) {

LOG.info("Flushing the datastore after " +counter.getRecordsNumber() + " records");

              store.flush();
            }
          }catch(Exception e){

LOG.info("Exception at GoraRecordWriter.class while writing todatastore." + e.getMessage());

          }
  }

which seems to me that would allow the code to recover from this kind oferrors. now I get gora through ivy and I don't know how or if I can haveivy to fetch the trunk but regardless I still think the question remainswhether it is a nutch issue or gora?



sorry for the long email.


On 03/30/2013 04:03 PM, Lewis John Mcgibbney wrote:

I think we need also may need to add the BATCH_ID to one Job's HashSet

private static final Collection<WebPage.Field> FIELDS = new
HashSet<WebPage.Field>();
static {
...
   FIELDS.add(WebPage.Field.BATCH_ID);
}


On Sat, Mar 30, 2013 at 3:55 PM, Lewis John Mcgibbney <
[email protected]> wrote:

Hi,
I've tried to sort this out locally this morning...
I can almost replicate this behaviour with gora-cassandra and it looks
most likely that the patch(es) applied in
* NUTCH-1533 - NUTCH-1532 Implement getPrevModifiedTime(),
setPrevModifiedTime(), getBatchId() and setBatchId() accessors in
o.a.n.storage.WebPage, and
* NUTCH-1532 - Replace 'segment' mapping field with batchId,
respectively are not backwards compatible because some URLs within the web
database do not contain values to the batchId.
Of course this is a major problem.
I opened NUTCH-1551 [0] and submitted a patch to make WebTableReader
backwards compatible with the above patches. Please try out the patch if
you can and comment so I can commit.

We have a couple options here.
1) Revert both of the above until we can get a fix
2) Get a fix just now and commit it.
What do you guys want to do?

I have a question about whether or not we can dynamically add fields to
existing data base entires by injecting them?
Say for example, you inject URLs without the batchId field in your mapping
file, then add the field and inject some more URLs... will the field be
added to your data base? If so then why are we getting the NPE?
There must be some other location in the Nutch code where an asserted
attempt is being made to obtain the batchId fore some given key... it
cannot be obtained and we receive the NPE.

[0] https://issues.apache.org/jira/browse/NUTCH-1551


On Fri, Mar 29, 2013 at 5:05 PM, kaveh minooie <[email protected]> wrote:

I use git and i fetch from github 
(https://github.com/apache/**nutch.git<https://github.com/apache/nutch.git>) 
currently I am on this commit:

commit 4bb01d6b908dc230c8be89d398b03a**86581ec42b
Author: lufeng <[email protected]>
Date:   Thu Mar 28 13:09:09 2013 +0000

     NUTCH-1547 BasicIndexingFilter - Problem to index full title

     git-svn-id: https://svn.apache.org/repos/**
asf/nutch/branches/2.x@1462079<https://svn.apache.org/repos/asf/nutch/branches/2.x@1462079>13f79535-47bb-0310-9956-
**ffa450edef68


before I was on this commit :


commit f02dcf62566583551426c08bd38808**0e5b2bc93e

  f02dcf6 NUTCH-XX remove unused db.max.inlinks from nutch-default.xml



On 03/29/2013 04:35 PM, [email protected] wrote:

Yes, with hbase. Here is the error

13/03/29 16:33:29 INFO zookeeper.ZooKeeper: Session: 0x13d7770d67d005f
closed
13/03/29 16:33:29 ERROR crawl.WebTableReader: WebTableReader:
java.lang.NullPointerException
          at org.apache.gora.hbase.store.**HBaseStore.addFields(**
HBaseStore.java:398)
          at org.apache.gora.hbase.store.**HBaseStore.execute(HBaseStore.
**java:360)
          at org.apache.nutch.crawl.**WebTableReader.read(**
WebTableReader.java:234)
          at org.apache.nutch.crawl.**WebTableReader.run(**
WebTableReader.java:476)
          at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**
java:65)
          at org.apache.nutch.crawl.**WebTableReader.main(**
WebTableReader.java:412)
          at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native
Method)
          at sun.reflect.**NativeMethodAccessorImpl.**invoke(**
NativeMethodAccessorImpl.java:**39)
          at sun.reflect.**DelegatingMethodAccessorImpl.**invoke(**
DelegatingMethodAccessorImpl.**java:25)
          at java.lang.reflect.Method.**invoke(Method.java:597)
          at org.apache.hadoop.util.RunJar.**main(RunJar.java:156)


If I revert to previous release it works fine.

Thanks.
Alex.





-----Original Message-----
From: Lewis John Mcgibbney <[email protected]>
To: user <[email protected]>
Sent: Fri, Mar 29, 2013 4:30 pm
Subject: Re: error using generate in 2.x


Hi Alex,
With HBase also?
There 'was' a bug in gora-cassandra module for this command + params
however I thought it had been addressed and therefore resolved it.
Lewis


On Fri, Mar 29, 2013 at 4:00 PM, <[email protected]> wrote:

  Hi,


It seems that trunk has a few bugs. I found out that readdb -url urlname
also gives errors.

Thanks.
Alex.







-----Original Message-----
From: kaveh minooie <[email protected]>
To: user <[email protected]>
Sent: Fri, Mar 29, 2013 1:53 pm
Subject: Re: error using generate in 2.x


Hi lewis

the mapping file that I am using is the one that comes with nutch, and I
haven't touched it. this message in the log is caused by using the
-crawlId on the command line. for example this log was the result of
this command :

bin/nutch generate -topN 1000 -crawlId t1

which causes the nutch( or i guess technically gora ) to use a table
name 't1_webpage'. thou, I have to say that i don't understand the
rational behind the code generating a warning like this ( I mean I know
it is not actually a warning, just that the way the message has been
phrased makes it look like warning) for something that should be a
routine operation. for someone like me who is crawling ( i mean hoping
to cause it is not working right now ) thousands of websites to maintain
multiple crawldb ( or its equivalent in gora, webpage table ) for
different group of websites.


Now that being said, it has nothing to do with the problem that I am
having. it is the same when I ommit the -crawlId parameter ( forcing it
to use the default name webpage ), and more importantly it is new. I
haven't had this problem before, it just started to happening 2 days ago
when i pulled the latest commits to 2.x branch.


On 03/29/2013 09:50 AM, Lewis John Mcgibbney wrote:

Hi Kaveh,
Firstly, as logged below, Gora attempts to associate your HBase table
configuration with specified tables (from within
gora-hbase-mapping.xml)
however it seems that your case satisfies the condition "if
(!tableName.equals(**tableNameFromMapping))" meaining that the table
name

is

not equal to the value for the table name attribute or that this value
is
null.
This is allowed, but I am interested to find out what the mapping file
looks like... the entire file is not required, just the <class

name="value"

snippet if this is possible.
I am not using the gora-hbase module and haven't ever seen anyone come
across this problem before.
Thanks
Lewis

On Thursday, March 28, 2013, kaveh minooie <[email protected]> wrote:

  2013-03-28 11:06:25,158 INFO  store.HBaseStore - Keyclass and

nameclass

match but mismatching table names  mappingfile schema is 'webpage' vs
actual schema 't1_webpage' , assuming they are the same.

--
Kaveh Minooie

--
Kaveh Minooie




--
*Lewis*


--
Kaveh Minooie

Re: error using generate in 2.x

Reply via email to