Hi
first of all I am posting this to both user and dev list since this is
becoming a dev issue more than anything else, and it seems to me that
this issue needs to be moved to that list, but let me know if I am wrong
cause I don't want to generate any more spam than we are already getting
in those lists.
The patch NUTCH-1551 didn't solve my issue. I am still getting the same
exact error when i try to run generate. (this was run in local mode) :
2013-04-01 11:43:27,710 INFO store.HBaseStore - Keyclass and nameclass
match but mismatching table names mappingfile schema is 'webpage' vs
actual schema 't1_webpage' , assuming they are the same.
2013-04-01 11:43:27,718 INFO mapreduce.GoraRecordWriter -
gora.buffer.write.limit = 10000
2013-04-01 11:43:27,838 WARN mapred.FileOutputCommitter - Output path
is null in cleanup
2013-04-01 11:43:27,839 WARN mapred.LocalJobRunner - job_local_0001
java.lang.NullPointerException
at org.apache.gora.hbase.store.HBaseStore.put(HBaseStore.java:235)
at
org.apache.gora.mapreduce.GoraRecordWriter.write(GoraRecordWriter.java:60)
at
org.apache.hadoop.mapred.ReduceTask$NewTrackingRecordWriter.write(ReduceTask.java:588)
at
org.apache.hadoop.mapreduce.TaskInputOutputContext.write(TaskInputOutputContext.java:80)
at
org.apache.nutch.crawl.GeneratorReducer.reduce(GeneratorReducer.java:79)
at
org.apache.nutch.crawl.GeneratorReducer.reduce(GeneratorReducer.java:40)
at org.apache.hadoop.mapreduce.Reducer.run(Reducer.java:176)
at
org.apache.hadoop.mapred.ReduceTask.runNewReducer(ReduceTask.java:650)
at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:418)
at
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:260)
2013-04-01 11:43:28,763 ERROR crawl.GeneratorJob - GeneratorJob:
java.lang.RuntimeException: job failed: name=[t1]generate:
1364841802-1763249246, jobid=job_local_0001
at
org.apache.nutch.util.NutchJob.waitForCompletion(NutchJob.java:54)
at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:193)
at
org.apache.nutch.crawl.GeneratorJob.generate(GeneratorJob.java:219)
at org.apache.nutch.crawl.GeneratorJob.run(GeneratorJob.java:264)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.GeneratorJob.main(GeneratorJob.java:272)
now i did a little bit of tracing and now I am not sure whether it is a
nutch issue or gora anymore because:
the original error (NPE) come from here (in
gora/blob/trunk/gora-hbase/src/main/java/org/apache/gora/hbase/store/HBaseStore.java:235)
case MAP:
if(o instanceof StatefulMap) {
StatefulHashMap<Utf8, ?> map = (StatefulHashMap<Utf8, ?>) o;
for (Entry<Utf8, State> e : map.states().entrySet()) {
Utf8 mapKey = e.getKey();
switch (e.getValue()) {
case DIRTY:
---> byte[] qual = Bytes.toBytes(mapKey.toString());
byte[] val = toBytes(map.get(mapKey),
field.schema().getValueType());
put.add(hcol.getFamily(), qual, val);
hasPuts = true;
break;
case DELETED:
qual = Bytes.toBytes(mapKey.toString());
hasDeletes = true;
delete.deleteColumn(hcol.getFamily(), qual);
break;
}
}
} else {
now the likely variable that is null seems to be 'mapkey' which is
probably as a result of male formed URL ( thou I can't say that for sure )
now the put function is being called from here
this is from gora 2.1:
gora/blob/0.2.1/gora-core/src/main/java/org/apache/gora/mapreduce/GoraRecordWriter.java:
@Override
public void write(K key, T value) throws IOException,
InterruptedException {
store.put(key, (Persistent) value);
counter.increment();
if (counter.isModulo()) {
LOG.info("Flushing the datastore after " +
counter.getRecordsNumber() + " records");
store.flush();
}
}
}
the same function in gora trunk is like this:
public void write(K key, T value) throws IOException, InterruptedException {
try{
store.put(key, (Persistent) value);
counter.increment();
if (counter.isModulo()) {
LOG.info("Flushing the datastore after " +
counter.getRecordsNumber() + " records");
store.flush();
}
}catch(Exception e){
LOG.info("Exception at GoraRecordWriter.class while writing to
datastore." + e.getMessage());
}
}
which seems to me that would allow the code to recover from this kind of
errors. now I get gora through ivy and I don't know how or if I can have
ivy to fetch the trunk but regardless I still think the question remains
whether it is a nutch issue or gora?
sorry for the long email.
On 03/30/2013 04:03 PM, Lewis John Mcgibbney wrote:
I think we need also may need to add the BATCH_ID to one Job's HashSet
private static final Collection<WebPage.Field> FIELDS = new
HashSet<WebPage.Field>();
static {
...
FIELDS.add(WebPage.Field.BATCH_ID);
}
On Sat, Mar 30, 2013 at 3:55 PM, Lewis John Mcgibbney <
[email protected]> wrote:
Hi,
I've tried to sort this out locally this morning...
I can almost replicate this behaviour with gora-cassandra and it looks
most likely that the patch(es) applied in
* NUTCH-1533 - NUTCH-1532 Implement getPrevModifiedTime(),
setPrevModifiedTime(), getBatchId() and setBatchId() accessors in
o.a.n.storage.WebPage, and
* NUTCH-1532 - Replace 'segment' mapping field with batchId,
respectively are not backwards compatible because some URLs within the web
database do not contain values to the batchId.
Of course this is a major problem.
I opened NUTCH-1551 [0] and submitted a patch to make WebTableReader
backwards compatible with the above patches. Please try out the patch if
you can and comment so I can commit.
We have a couple options here.
1) Revert both of the above until we can get a fix
2) Get a fix just now and commit it.
What do you guys want to do?
I have a question about whether or not we can dynamically add fields to
existing data base entires by injecting them?
Say for example, you inject URLs without the batchId field in your mapping
file, then add the field and inject some more URLs... will the field be
added to your data base? If so then why are we getting the NPE?
There must be some other location in the Nutch code where an asserted
attempt is being made to obtain the batchId fore some given key... it
cannot be obtained and we receive the NPE.
[0] https://issues.apache.org/jira/browse/NUTCH-1551
On Fri, Mar 29, 2013 at 5:05 PM, kaveh minooie <[email protected]> wrote:
I use git and i fetch from github
(https://github.com/apache/**nutch.git<https://github.com/apache/nutch.git>)
currently I am on this commit:
commit 4bb01d6b908dc230c8be89d398b03a**86581ec42b
Author: lufeng <[email protected]>
Date: Thu Mar 28 13:09:09 2013 +0000
NUTCH-1547 BasicIndexingFilter - Problem to index full title
git-svn-id: https://svn.apache.org/repos/**
asf/nutch/branches/2.x@1462079<https://svn.apache.org/repos/asf/nutch/branches/2.x@1462079>13f79535-47bb-0310-9956-
**ffa450edef68
before I was on this commit :
commit f02dcf62566583551426c08bd38808**0e5b2bc93e
f02dcf6 NUTCH-XX remove unused db.max.inlinks from nutch-default.xml
On 03/29/2013 04:35 PM, [email protected] wrote:
Yes, with hbase. Here is the error
13/03/29 16:33:29 INFO zookeeper.ZooKeeper: Session: 0x13d7770d67d005f
closed
13/03/29 16:33:29 ERROR crawl.WebTableReader: WebTableReader:
java.lang.NullPointerException
at org.apache.gora.hbase.store.**HBaseStore.addFields(**
HBaseStore.java:398)
at org.apache.gora.hbase.store.**HBaseStore.execute(HBaseStore.
**java:360)
at org.apache.nutch.crawl.**WebTableReader.read(**
WebTableReader.java:234)
at org.apache.nutch.crawl.**WebTableReader.run(**
WebTableReader.java:476)
at org.apache.hadoop.util.**ToolRunner.run(ToolRunner.**
java:65)
at org.apache.nutch.crawl.**WebTableReader.main(**
WebTableReader.java:412)
at sun.reflect.**NativeMethodAccessorImpl.**invoke0(Native
Method)
at sun.reflect.**NativeMethodAccessorImpl.**invoke(**
NativeMethodAccessorImpl.java:**39)
at sun.reflect.**DelegatingMethodAccessorImpl.**invoke(**
DelegatingMethodAccessorImpl.**java:25)
at java.lang.reflect.Method.**invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.**main(RunJar.java:156)
If I revert to previous release it works fine.
Thanks.
Alex.
-----Original Message-----
From: Lewis John Mcgibbney <[email protected]>
To: user <[email protected]>
Sent: Fri, Mar 29, 2013 4:30 pm
Subject: Re: error using generate in 2.x
Hi Alex,
With HBase also?
There 'was' a bug in gora-cassandra module for this command + params
however I thought it had been addressed and therefore resolved it.
Lewis
On Fri, Mar 29, 2013 at 4:00 PM, <[email protected]> wrote:
Hi,
It seems that trunk has a few bugs. I found out that readdb -url urlname
also gives errors.
Thanks.
Alex.
-----Original Message-----
From: kaveh minooie <[email protected]>
To: user <[email protected]>
Sent: Fri, Mar 29, 2013 1:53 pm
Subject: Re: error using generate in 2.x
Hi lewis
the mapping file that I am using is the one that comes with nutch, and I
haven't touched it. this message in the log is caused by using the
-crawlId on the command line. for example this log was the result of
this command :
bin/nutch generate -topN 1000 -crawlId t1
which causes the nutch( or i guess technically gora ) to use a table
name 't1_webpage'. thou, I have to say that i don't understand the
rational behind the code generating a warning like this ( I mean I know
it is not actually a warning, just that the way the message has been
phrased makes it look like warning) for something that should be a
routine operation. for someone like me who is crawling ( i mean hoping
to cause it is not working right now ) thousands of websites to maintain
multiple crawldb ( or its equivalent in gora, webpage table ) for
different group of websites.
Now that being said, it has nothing to do with the problem that I am
having. it is the same when I ommit the -crawlId parameter ( forcing it
to use the default name webpage ), and more importantly it is new. I
haven't had this problem before, it just started to happening 2 days ago
when i pulled the latest commits to 2.x branch.
On 03/29/2013 09:50 AM, Lewis John Mcgibbney wrote:
Hi Kaveh,
Firstly, as logged below, Gora attempts to associate your HBase table
configuration with specified tables (from within
gora-hbase-mapping.xml)
however it seems that your case satisfies the condition "if
(!tableName.equals(**tableNameFromMapping))" meaining that the table
name
is
not equal to the value for the table name attribute or that this value
is
null.
This is allowed, but I am interested to find out what the mapping file
looks like... the entire file is not required, just the <class
name="value"
snippet if this is possible.
I am not using the gora-hbase module and haven't ever seen anyone come
across this problem before.
Thanks
Lewis
On Thursday, March 28, 2013, kaveh minooie <[email protected]> wrote:
2013-03-28 11:06:25,158 INFO store.HBaseStore - Keyclass and
nameclass
match but mismatching table names mappingfile schema is 'webpage' vs
actual schema 't1_webpage' , assuming they are the same.
--
Kaveh Minooie
--
Kaveh Minooie
--
*Lewis*
--
Kaveh Minooie