RE: Nutch 2 and Cassandra

2011-08-02 Thread Tom Davidson
I found the problem. I am using Cloudera CDH3 and it has a hue plugins jar with 
an older thrift library in it. I removed the jar from my classpath and all is 
good. Thanks for your help.

-Original Message-
From: Tom Davidson [mailto:tdavid...@covario.com] 
Sent: Monday, August 01, 2011 3:29 PM
To: dev@nutch.apache.org
Subject: RE: Nutch 2 and Cassandra

OK... Are you running with a clustered version of Hadoop? I think you have to 
have your HADOOP_HOME env variable set. Otherwise it runs in local mode. I have 
been able to run in local mode, but not in deployed mode.


-Original Message-
From: Alexis [mailto:alexis.detregl...@gmail.com] 
Sent: Monday, August 01, 2011 3:25 PM
To: dev@nutch.apache.org
Subject: Re: Nutch 2 and Cassandra

Ok this version of hector was properly resolved. Thanks!

These are the logs:
~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject
~/java/workspace/Nutch/seeds
11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: starting
11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: urlDir:
/home/alex/java/workspace/Nutch/seeds
11/08/01 15:17:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
11/08/01 15:17:46 INFO connection.CassandraHostRetryService: Downed
Host Retry service started with queue size -1 and retry delay 10s
11/08/01 15:17:46 INFO service.JmxMonitor: Registering JMX
me.prettyprint.cassandra.service_Test
Cluster:ServiceType=hector,MonitorType=hector
11/08/01 15:17:47 INFO store.CassandraClient: Keyspace 'webpage' in
cluster 'Test Cluster' was created on host 'localhost'
11/08/01 15:17:48 INFO input.FileInputFormat: Total input paths to process : 1
11/08/01 15:17:49 INFO mapred.JobClient: Running job: job_local_0001
11/08/01 15:17:49 INFO input.FileInputFormat: Total input paths to process : 1
11/08/01 15:17:49 INFO mapreduce.GoraRecordWriter:
gora.buffer.write.limit = 1
11/08/01 15:17:49 INFO plugin.PluginRepository: Plugins: looking in:
/tmp/hadoop-alex/hadoop-unjar8045717865743865180/plugins
11/08/01 15:17:49 INFO plugin.PluginRepository: Plugin Auto-activation
mode: [true]
11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Plugins:
11/08/01 15:17:49 INFO plugin.PluginRepository: the nutch core
extension points (nutch-extensionpoints)
11/08/01 15:17:49 INFO plugin.PluginRepository: Basic URL
Normalizer (urlnormalizer-basic)
11/08/01 15:17:49 INFO plugin.PluginRepository: Basic Indexing
Filter (index-basic)
11/08/01 15:17:49 INFO plugin.PluginRepository: Html Parse
Plug-in (parse-html)
11/08/01 15:17:49 INFO plugin.PluginRepository: HTTP Framework
(lib-http)
11/08/01 15:17:49 INFO plugin.PluginRepository: Pass-through
URL Normalizer (urlnormalizer-pass)
11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL
Filter (urlfilter-regex)
11/08/01 15:17:49 INFO plugin.PluginRepository: Http Protocol
Plug-in (protocol-http)
11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL
Normalizer (urlnormalizer-regex)
11/08/01 15:17:49 INFO plugin.PluginRepository: Tika Parser
Plug-in (parse-tika)
11/08/01 15:17:49 INFO plugin.PluginRepository: OPIC Scoring
Plug-in (scoring-opic)
11/08/01 15:17:49 INFO plugin.PluginRepository: CyberNeko HTML
Parser (lib-nekohtml)
11/08/01 15:17:49 INFO plugin.PluginRepository: Anchor
Indexing Filter (index-anchor)
11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL
Filter Framework (lib-regex-filter)
11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Extension-Points:
11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Protocol
(org.apache.nutch.protocol.Protocol)
11/08/01 15:17:49 INFO plugin.PluginRepository: Parse Filter
(org.apache.nutch.parse.ParseFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch URL
Filter (org.apache.nutch.net.URLFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Content
Parser (org.apache.nutch.parse.Parser)
11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
11/08/01 15:17:50 INFO conf.Configuration: found resource
regex-normalize.xml at
file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-normalize.xml
11/08/01 15:17:50 INFO conf.Configuration: found resource
regex-urlfilter.txt at
file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-urlfilter.txt
11/08/01 15:17:50 INFO regex.RegexURLNormalizer: can't find rules for
scope 'inject', using default
11/08/01 15:17:50 INFO mapred.JobClient:  map 0% reduce 0%
11/08/01 15:17:51 INFO mapred.TaskRunner:
Task:attempt_local_0001_m_00_0 is done. And is in the process of
commiting
11/08/01 15:17:51 INFO

Re: Nutch 2 and Cassandra

2011-08-02 Thread lewis john mcgibbney
Hi

I've been watching progress on this thread with interest and think that this
would be a great addition to the wiki under the following page [1]

I am happy to write it up, however is there anything else we need to be
aware of in addition to the material you have provided, for example some
latent info that has been assumed or not been explained.

Thank you

[1] http://wiki.apache.org/nutch/ErrorMessagesInNutch2

On Tue, Aug 2, 2011 at 6:32 PM, Tom Davidson tdavid...@covario.com wrote:

 I found the problem. I am using Cloudera CDH3 and it has a hue plugins jar
 with an older thrift library in it. I removed the jar from my classpath and
 all is good. Thanks for your help.

 -Original Message-
 From: Tom Davidson [mailto:tdavid...@covario.com]
 Sent: Monday, August 01, 2011 3:29 PM
 To: dev@nutch.apache.org
 Subject: RE: Nutch 2 and Cassandra

 OK... Are you running with a clustered version of Hadoop? I think you have
 to have your HADOOP_HOME env variable set. Otherwise it runs in local mode.
 I have been able to run in local mode, but not in deployed mode.


 -Original Message-
 From: Alexis [mailto:alexis.detregl...@gmail.com]
 Sent: Monday, August 01, 2011 3:25 PM
 To: dev@nutch.apache.org
 Subject: Re: Nutch 2 and Cassandra

 Ok this version of hector was properly resolved. Thanks!

 These are the logs:
 ~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject
 ~/java/workspace/Nutch/seeds
 11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: starting
 11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: urlDir:
 /home/alex/java/workspace/Nutch/seeds
 11/08/01 15:17:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with
 processName=JobTracker, sessionId=
 11/08/01 15:17:46 INFO connection.CassandraHostRetryService: Downed
 Host Retry service started with queue size -1 and retry delay 10s
 11/08/01 15:17:46 INFO service.JmxMonitor: Registering JMX
 me.prettyprint.cassandra.service_Test
 Cluster:ServiceType=hector,MonitorType=hector
 11/08/01 15:17:47 INFO store.CassandraClient: Keyspace 'webpage' in
 cluster 'Test Cluster' was created on host 'localhost'
 11/08/01 15:17:48 INFO input.FileInputFormat: Total input paths to process
 : 1
 11/08/01 15:17:49 INFO mapred.JobClient: Running job: job_local_0001
 11/08/01 15:17:49 INFO input.FileInputFormat: Total input paths to process
 : 1
 11/08/01 15:17:49 INFO mapreduce.GoraRecordWriter:
 gora.buffer.write.limit = 1
 11/08/01 15:17:49 INFO plugin.PluginRepository: Plugins: looking in:
 /tmp/hadoop-alex/hadoop-unjar8045717865743865180/plugins
 11/08/01 15:17:49 INFO plugin.PluginRepository: Plugin Auto-activation
 mode: [true]
 11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Plugins:
 11/08/01 15:17:49 INFO plugin.PluginRepository: the nutch core
 extension points (nutch-extensionpoints)
 11/08/01 15:17:49 INFO plugin.PluginRepository: Basic URL
 Normalizer (urlnormalizer-basic)
 11/08/01 15:17:49 INFO plugin.PluginRepository: Basic Indexing
 Filter (index-basic)
 11/08/01 15:17:49 INFO plugin.PluginRepository: Html Parse
 Plug-in (parse-html)
 11/08/01 15:17:49 INFO plugin.PluginRepository: HTTP Framework
 (lib-http)
 11/08/01 15:17:49 INFO plugin.PluginRepository: Pass-through
 URL Normalizer (urlnormalizer-pass)
 11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL
 Filter (urlfilter-regex)
 11/08/01 15:17:49 INFO plugin.PluginRepository: Http Protocol
 Plug-in (protocol-http)
 11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL
 Normalizer (urlnormalizer-regex)
 11/08/01 15:17:49 INFO plugin.PluginRepository: Tika Parser
 Plug-in (parse-tika)
 11/08/01 15:17:49 INFO plugin.PluginRepository: OPIC Scoring
 Plug-in (scoring-opic)
 11/08/01 15:17:49 INFO plugin.PluginRepository: CyberNeko HTML
 Parser (lib-nekohtml)
 11/08/01 15:17:49 INFO plugin.PluginRepository: Anchor
 Indexing Filter (index-anchor)
 11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL
 Filter Framework (lib-regex-filter)
 11/08/01 15:17:49 INFO plugin.PluginRepository: Registered
 Extension-Points:
 11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch URL
 Normalizer (org.apache.nutch.net.URLNormalizer)
 11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Protocol
 (org.apache.nutch.protocol.Protocol)
 11/08/01 15:17:49 INFO plugin.PluginRepository: Parse Filter
 (org.apache.nutch.parse.ParseFilter)
 11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch URL
 Filter (org.apache.nutch.net.URLFilter)
 11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Indexing
 Filter (org.apache.nutch.indexer.IndexingFilter)
 11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Content
 Parser (org.apache.nutch.parse.Parser)
 11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Scoring
 (org.apache.nutch.scoring.ScoringFilter)
 11/08/01 15:17:50 INFO conf.Configuration: found

RE: Nutch 2 and Cassandra

2011-08-02 Thread Tom Davidson
I did run into a couple more problems running Nutch 2 with CDH3. See 
https://issues.apache.org/jira/browse/NUTCH-937. I added a comment on the 
thread explaining my additional problem. I worked around the problem by 
unjarring the nutch-2-dev.job and seeting the HADOOP_CLASSPATH (see below) 
environment variable. Not an ideal solution, but it works.

In order to run Nutch 2 on CDH3 I added the following to nutch-site.xml and 
rebuilt the nutch-2-dev.job:

property
namemapreduce.job.jar.unpack.pattern/name
value(?:classes/|lib/|plugins/).*/value
/property

property
nameplugin.folders/name
value${job.local.dir}/../jars/plugins/value
/property

And I had to set this environment variable to my expanded plugins folder:

export HADOOP_OPTS=-Djob.local.dir=/MY HOME/nutch/plugins





From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com]
Sent: Tuesday, August 02, 2011 2:00 PM
To: dev@nutch.apache.org
Subject: Re: Nutch 2 and Cassandra

Hi

I've been watching progress on this thread with interest and think that this 
would be a great addition to the wiki under the following page [1]

I am happy to write it up, however is there anything else we need to be aware 
of in addition to the material you have provided, for example some latent info 
that has been assumed or not been explained.

Thank you

[1] http://wiki.apache.org/nutch/ErrorMessagesInNutch2
On Tue, Aug 2, 2011 at 6:32 PM, Tom Davidson 
tdavid...@covario.commailto:tdavid...@covario.com wrote:
I found the problem. I am using Cloudera CDH3 and it has a hue plugins jar with 
an older thrift library in it. I removed the jar from my classpath and all is 
good. Thanks for your help.

-Original Message-
From: Tom Davidson [mailto:tdavid...@covario.commailto:tdavid...@covario.com]
Sent: Monday, August 01, 2011 3:29 PM
To: dev@nutch.apache.orgmailto:dev@nutch.apache.org
Subject: RE: Nutch 2 and Cassandra

OK... Are you running with a clustered version of Hadoop? I think you have to 
have your HADOOP_HOME env variable set. Otherwise it runs in local mode. I have 
been able to run in local mode, but not in deployed mode.


-Original Message-
From: Alexis 
[mailto:alexis.detregl...@gmail.commailto:alexis.detregl...@gmail.com]
Sent: Monday, August 01, 2011 3:25 PM
To: dev@nutch.apache.orgmailto:dev@nutch.apache.org
Subject: Re: Nutch 2 and Cassandra

Ok this version of hector was properly resolved. Thanks!

These are the logs:
~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject
~/java/workspace/Nutch/seeds
11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: starting
11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: urlDir:
/home/alex/java/workspace/Nutch/seeds
11/08/01 15:17:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
11/08/01 15:17:46 INFO connection.CassandraHostRetryService: Downed
Host Retry service started with queue size -1 and retry delay 10s
11/08/01 15:17:46 INFO service.JmxMonitor: Registering JMX
me.prettyprint.cassandra.service_Test
Cluster:ServiceType=hector,MonitorType=hector
11/08/01 15:17:47 INFO store.CassandraClient: Keyspace 'webpage' in
cluster 'Test Cluster' was created on host 'localhost'
11/08/01 15:17:48 INFO input.FileInputFormat: Total input paths to process : 1
11/08/01 15:17:49 INFO mapred.JobClient: Running job: job_local_0001
11/08/01 15:17:49 INFO input.FileInputFormat: Total input paths to process : 1
11/08/01 15:17:49 INFO mapreduce.GoraRecordWriter:
gora.buffer.write.limit = 1
11/08/01 15:17:49 INFO plugin.PluginRepository: Plugins: looking in:
/tmp/hadoop-alex/hadoop-unjar8045717865743865180/plugins
11/08/01 15:17:49 INFO plugin.PluginRepository: Plugin Auto-activation
mode: [true]
11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Plugins:
11/08/01 15:17:49 INFO plugin.PluginRepository: the nutch core
extension points (nutch-extensionpoints)
11/08/01 15:17:49 INFO plugin.PluginRepository: Basic URL
Normalizer (urlnormalizer-basic)
11/08/01 15:17:49 INFO plugin.PluginRepository: Basic Indexing
Filter (index-basic)
11/08/01 15:17:49 INFO plugin.PluginRepository: Html Parse
Plug-in (parse-html)
11/08/01 15:17:49 INFO plugin.PluginRepository: HTTP Framework
(lib-http)
11/08/01 15:17:49 INFO plugin.PluginRepository: Pass-through
URL Normalizer (urlnormalizer-pass)
11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL
Filter (urlfilter-regex)
11/08/01 15:17:49 INFO plugin.PluginRepository: Http Protocol
Plug-in (protocol-http)
11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL
Normalizer (urlnormalizer-regex)
11/08/01 15:17:49 INFO plugin.PluginRepository: Tika Parser
Plug-in (parse-tika)
11/08/01 15:17:49 INFO plugin.PluginRepository: OPIC Scoring
Plug-in (scoring-opic)
11/08/01 15:17:49 INFO plugin.PluginRepository: CyberNeko HTML
Parser (lib-nekohtml)
11/08/01 15:17:49 INFO

Re: Nutch 2 and Cassandra

2011-08-01 Thread Alexis
Hi, libthrift is a dependency of cassandra-thrift, as listed here:
http://mvnrepository.com/artifact/org.apache.cassandra/cassandra-thrift/0.8.1

During Nutch build, you have to manually tweak the Ivy configuration
depending on your choice of the Gora store, in this case Cassandra.
Basically you need to add all the dependencies listed there:
http://svn.apache.org/viewvc/incubator/gora/trunk/gora-cassandra/ivy/ivy.xml?view=markup

Let's try to add to $NUTCH_HOME/ivy/ivy.xml the following dependencies
and then let's rebuild Nutch (see attached patch):
dependency org=org.apache.gora name=gora-cassandra
rev=0.2-incubating conf=*-compile/
dependency org=org.apache.cassandra name=cassandra-thrift 
rev=0.8.1/
dependency org=com.ecyrd.speed4j name=speed4j rev=0.9
conf=*-*,!javadoc,!sources/
dependency org=com.github.stephenc.high-scale-lib
name=high-scale-lib rev=1.1.2 conf=*-*,!javadoc,!sources/
dependency org=com.google.collections name=google-collections
rev=1.0 conf=*-*,!javadoc,!sources/
dependency org=com.google.guava name=guava rev=r09
conf=*-*,!javadoc,!sources/

$ ant clean
$ ant

In your case libthrift should now be downloaded by Ivy and then
bundled into the nutch-2.0-dev.job file. I'm not sure how
apache-cassandra and hector got included in your classpath...

Somehow we need to resolve as well:
dependency org=org.apache.cassandra name=apache-cassandra
rev=0.8.1/
dependency org=me.prettyprint name=hector rev=0.8.0-1/

I don't think the following 2 jars are in the default maven repository
so they won't be downloaded, that's why they were commented in the
Gora Cassandra Ivy config (gora/trunk/gora-cassandra/ivy/ivy.xml)


Since hector jar is not found in my case I get:
~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject
~/java/workspace/Nutch/seeds
11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: starting
11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: urlDir:
/home/alex/java/workspace/Nutch/seeds
11/08/01 14:18:42 INFO security.Groups: Group mapping
impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
cacheTimeout=30
11/08/01 14:18:42 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
11/08/01 14:18:42 ERROR crawl.InjectorJob: InjectorJob:
org.apache.gora.util.GoraException:
java.lang.reflect.InvocationTargetException
at 
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110)
at 
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93)
at 
org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:59)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:192)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at 
org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:76)
at 
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:102)
... 12 more
Caused by: java.lang.NoClassDefFoundError: me/prettyprint/hector/api/Serializer
at 
org.apache.gora.cassandra.store.CassandraStore.init(CassandraStore.java:60)
... 18 more
Caused by: java.lang.ClassNotFoundException:
me.prettyprint.hector.api.Serializer
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 19 more




On Mon, Aug 1, 2011 at 11:59 AM, Tom Davidson tdavid...@covario.com wrote:
 Hi All,



 I am kind of at my wit’s end here, so I am hoping someone here can help.  I
 am trying to use Nutch2 and Cassandra and I have been successful using the
 runtime/local build. I am using the Cloudera CDH3 on CentOs 5 and I do not
 want to 

Re: Nutch 2 and Cassandra

2011-08-01 Thread Alexis
 INFO jvm.JvmMetrics: Cannot initialize JVM Metrics
with processName=JobTracker, sessionId= - already initialized
11/08/01 15:17:52 INFO crawl.InjectorJob: InjectorJob: finished



This is what was added to ivy/ivy.xml:

+   dependency org=org.apache.gora name=gora-cassandra
rev=0.2-incubating conf=*-compile/
+   dependency org=org.apache.cassandra name=cassandra-thrift
rev=0.8.1/
+   dependency org=com.ecyrd.speed4j name=speed4j rev=0.9
conf=*-*,!javadoc,!sources/
+   dependency org=com.github.stephenc.high-scale-lib
name=high-scale-lib rev=1.1.2 conf=*-*,!javadoc,!sources/
+   dependency org=com.google.collections
name=google-collections rev=1.0 conf=*-*,!javadoc,!sources/
+   dependency org=com.google.guava name=guava rev=r09
conf=*-*,!javadoc,!sources/
+   dependency org=org.apache.cassandra name=apache-cassandra
rev=0.8.1/
+   dependency org=me.prettyprint name=hector-core rev=0.8.0-2/



On Mon, Aug 1, 2011 at 2:55 PM, Tom Davidson tdavid...@covario.com wrote:
 I did something similar to below to add the Cassandra dependencies. Note that 
 I am getting NoSuchMethodErrors not ClassNotFoundExceptions. Can you add the 
 hector jars to your nutch job jar and see what you get? I think I am one step 
 ahead of you. BTW, I just added this line to get the hector dependency:

        dependency org=me.prettyprint name=hector-core rev=0.8.0-2 
 conf=*-default/

 -Original Message-
 From: Alexis [mailto:alexis.detregl...@gmail.com]
 Sent: Monday, August 01, 2011 2:28 PM
 To: dev@nutch.apache.org
 Subject: Re: Nutch 2 and Cassandra

 Hi, libthrift is a dependency of cassandra-thrift, as listed here:
 http://mvnrepository.com/artifact/org.apache.cassandra/cassandra-thrift/0.8.1

 During Nutch build, you have to manually tweak the Ivy configuration 
 depending on your choice of the Gora store, in this case Cassandra.
 Basically you need to add all the dependencies listed there:
 http://svn.apache.org/viewvc/incubator/gora/trunk/gora-cassandra/ivy/ivy.xml?view=markup

 Let's try to add to $NUTCH_HOME/ivy/ivy.xml the following dependencies and 
 then let's rebuild Nutch (see attached patch):
        dependency org=org.apache.gora name=gora-cassandra
 rev=0.2-incubating conf=*-compile/
        dependency org=org.apache.cassandra name=cassandra-thrift 
 rev=0.8.1/
        dependency org=com.ecyrd.speed4j name=speed4j rev=0.9
 conf=*-*,!javadoc,!sources/
        dependency org=com.github.stephenc.high-scale-lib
 name=high-scale-lib rev=1.1.2 conf=*-*,!javadoc,!sources/
        dependency org=com.google.collections name=google-collections
 rev=1.0 conf=*-*,!javadoc,!sources/
        dependency org=com.google.guava name=guava rev=r09
 conf=*-*,!javadoc,!sources/

 $ ant clean
 $ ant

 In your case libthrift should now be downloaded by Ivy and then bundled into 
 the nutch-2.0-dev.job file. I'm not sure how apache-cassandra and hector got 
 included in your classpath...

 Somehow we need to resolve as well:
        dependency org=org.apache.cassandra name=apache-cassandra
 rev=0.8.1/
        dependency org=me.prettyprint name=hector rev=0.8.0-1/

 I don't think the following 2 jars are in the default maven repository so 
 they won't be downloaded, that's why they were commented in the Gora 
 Cassandra Ivy config (gora/trunk/gora-cassandra/ivy/ivy.xml)


 Since hector jar is not found in my case I get:
 ~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject 
 ~/java/workspace/Nutch/seeds
 11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: starting
 11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: urlDir:
 /home/alex/java/workspace/Nutch/seeds
 11/08/01 14:18:42 INFO security.Groups: Group mapping 
 impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
 cacheTimeout=30
 11/08/01 14:18:42 INFO jvm.JvmMetrics: Initializing JVM Metrics with 
 processName=JobTracker, sessionId=
 11/08/01 14:18:42 ERROR crawl.InjectorJob: InjectorJob:
 org.apache.gora.util.GoraException:
 java.lang.reflect.InvocationTargetException
        at 
 org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110)
        at 
 org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93)
        at 
 org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:59)
        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
        at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
        at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
        at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
        at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at 
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
        at 
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
        at java.lang.reflect.Method.invoke(Method.java:597

RE: Nutch 2 and Cassandra

2011-08-01 Thread Tom Davidson
OK... Are you running with a clustered version of Hadoop? I think you have to 
have your HADOOP_HOME env variable set. Otherwise it runs in local mode. I have 
been able to run in local mode, but not in deployed mode.


-Original Message-
From: Alexis [mailto:alexis.detregl...@gmail.com] 
Sent: Monday, August 01, 2011 3:25 PM
To: dev@nutch.apache.org
Subject: Re: Nutch 2 and Cassandra

Ok this version of hector was properly resolved. Thanks!

These are the logs:
~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject
~/java/workspace/Nutch/seeds
11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: starting
11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: urlDir:
/home/alex/java/workspace/Nutch/seeds
11/08/01 15:17:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
11/08/01 15:17:46 INFO connection.CassandraHostRetryService: Downed
Host Retry service started with queue size -1 and retry delay 10s
11/08/01 15:17:46 INFO service.JmxMonitor: Registering JMX
me.prettyprint.cassandra.service_Test
Cluster:ServiceType=hector,MonitorType=hector
11/08/01 15:17:47 INFO store.CassandraClient: Keyspace 'webpage' in
cluster 'Test Cluster' was created on host 'localhost'
11/08/01 15:17:48 INFO input.FileInputFormat: Total input paths to process : 1
11/08/01 15:17:49 INFO mapred.JobClient: Running job: job_local_0001
11/08/01 15:17:49 INFO input.FileInputFormat: Total input paths to process : 1
11/08/01 15:17:49 INFO mapreduce.GoraRecordWriter:
gora.buffer.write.limit = 1
11/08/01 15:17:49 INFO plugin.PluginRepository: Plugins: looking in:
/tmp/hadoop-alex/hadoop-unjar8045717865743865180/plugins
11/08/01 15:17:49 INFO plugin.PluginRepository: Plugin Auto-activation
mode: [true]
11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Plugins:
11/08/01 15:17:49 INFO plugin.PluginRepository: the nutch core
extension points (nutch-extensionpoints)
11/08/01 15:17:49 INFO plugin.PluginRepository: Basic URL
Normalizer (urlnormalizer-basic)
11/08/01 15:17:49 INFO plugin.PluginRepository: Basic Indexing
Filter (index-basic)
11/08/01 15:17:49 INFO plugin.PluginRepository: Html Parse
Plug-in (parse-html)
11/08/01 15:17:49 INFO plugin.PluginRepository: HTTP Framework
(lib-http)
11/08/01 15:17:49 INFO plugin.PluginRepository: Pass-through
URL Normalizer (urlnormalizer-pass)
11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL
Filter (urlfilter-regex)
11/08/01 15:17:49 INFO plugin.PluginRepository: Http Protocol
Plug-in (protocol-http)
11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL
Normalizer (urlnormalizer-regex)
11/08/01 15:17:49 INFO plugin.PluginRepository: Tika Parser
Plug-in (parse-tika)
11/08/01 15:17:49 INFO plugin.PluginRepository: OPIC Scoring
Plug-in (scoring-opic)
11/08/01 15:17:49 INFO plugin.PluginRepository: CyberNeko HTML
Parser (lib-nekohtml)
11/08/01 15:17:49 INFO plugin.PluginRepository: Anchor
Indexing Filter (index-anchor)
11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL
Filter Framework (lib-regex-filter)
11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Extension-Points:
11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Protocol
(org.apache.nutch.protocol.Protocol)
11/08/01 15:17:49 INFO plugin.PluginRepository: Parse Filter
(org.apache.nutch.parse.ParseFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch URL
Filter (org.apache.nutch.net.URLFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Content
Parser (org.apache.nutch.parse.Parser)
11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
11/08/01 15:17:50 INFO conf.Configuration: found resource
regex-normalize.xml at
file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-normalize.xml
11/08/01 15:17:50 INFO conf.Configuration: found resource
regex-urlfilter.txt at
file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-urlfilter.txt
11/08/01 15:17:50 INFO regex.RegexURLNormalizer: can't find rules for
scope 'inject', using default
11/08/01 15:17:50 INFO mapred.JobClient:  map 0% reduce 0%
11/08/01 15:17:51 INFO mapred.TaskRunner:
Task:attempt_local_0001_m_00_0 is done. And is in the process of
commiting
11/08/01 15:17:51 INFO mapred.LocalJobRunner:
11/08/01 15:17:51 INFO mapred.TaskRunner: Task
'attempt_local_0001_m_00_0' done.
11/08/01 15:17:52 INFO mapred.JobClient:  map 100% reduce 0%
11/08/01 15:17:52 INFO mapred.JobClient: Job complete: job_local_0001
11/08/01 15:17:52 INFO mapred.JobClient: Counters: 5
11/08/01 15:17:52 INFO mapred.JobClient:   FileSystemCounters
11/08/01 15:17:52 INFO