RE: Nutch 2 and Cassandra
I found the problem. I am using Cloudera CDH3 and it has a hue plugins jar with an older thrift library in it. I removed the jar from my classpath and all is good. Thanks for your help. -Original Message- From: Tom Davidson [mailto:tdavid...@covario.com] Sent: Monday, August 01, 2011 3:29 PM To: dev@nutch.apache.org Subject: RE: Nutch 2 and Cassandra OK... Are you running with a clustered version of Hadoop? I think you have to have your HADOOP_HOME env variable set. Otherwise it runs in local mode. I have been able to run in local mode, but not in deployed mode. -Original Message- From: Alexis [mailto:alexis.detregl...@gmail.com] Sent: Monday, August 01, 2011 3:25 PM To: dev@nutch.apache.org Subject: Re: Nutch 2 and Cassandra Ok this version of hector was properly resolved. Thanks! These are the logs: ~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject ~/java/workspace/Nutch/seeds 11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: starting 11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: urlDir: /home/alex/java/workspace/Nutch/seeds 11/08/01 15:17:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 11/08/01 15:17:46 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s 11/08/01 15:17:46 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Test Cluster:ServiceType=hector,MonitorType=hector 11/08/01 15:17:47 INFO store.CassandraClient: Keyspace 'webpage' in cluster 'Test Cluster' was created on host 'localhost' 11/08/01 15:17:48 INFO input.FileInputFormat: Total input paths to process : 1 11/08/01 15:17:49 INFO mapred.JobClient: Running job: job_local_0001 11/08/01 15:17:49 INFO input.FileInputFormat: Total input paths to process : 1 11/08/01 15:17:49 INFO mapreduce.GoraRecordWriter: gora.buffer.write.limit = 1 11/08/01 15:17:49 INFO plugin.PluginRepository: Plugins: looking in: /tmp/hadoop-alex/hadoop-unjar8045717865743865180/plugins 11/08/01 15:17:49 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true] 11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Plugins: 11/08/01 15:17:49 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints) 11/08/01 15:17:49 INFO plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic) 11/08/01 15:17:49 INFO plugin.PluginRepository: Basic Indexing Filter (index-basic) 11/08/01 15:17:49 INFO plugin.PluginRepository: Html Parse Plug-in (parse-html) 11/08/01 15:17:49 INFO plugin.PluginRepository: HTTP Framework (lib-http) 11/08/01 15:17:49 INFO plugin.PluginRepository: Pass-through URL Normalizer (urlnormalizer-pass) 11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL Filter (urlfilter-regex) 11/08/01 15:17:49 INFO plugin.PluginRepository: Http Protocol Plug-in (protocol-http) 11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL Normalizer (urlnormalizer-regex) 11/08/01 15:17:49 INFO plugin.PluginRepository: Tika Parser Plug-in (parse-tika) 11/08/01 15:17:49 INFO plugin.PluginRepository: OPIC Scoring Plug-in (scoring-opic) 11/08/01 15:17:49 INFO plugin.PluginRepository: CyberNeko HTML Parser (lib-nekohtml) 11/08/01 15:17:49 INFO plugin.PluginRepository: Anchor Indexing Filter (index-anchor) 11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL Filter Framework (lib-regex-filter) 11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Extension-Points: 11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Protocol (org.apache.nutch.protocol.Protocol) 11/08/01 15:17:49 INFO plugin.PluginRepository: Parse Filter (org.apache.nutch.parse.ParseFilter) 11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch URL Filter (org.apache.nutch.net.URLFilter) 11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Content Parser (org.apache.nutch.parse.Parser) 11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 11/08/01 15:17:50 INFO conf.Configuration: found resource regex-normalize.xml at file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-normalize.xml 11/08/01 15:17:50 INFO conf.Configuration: found resource regex-urlfilter.txt at file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-urlfilter.txt 11/08/01 15:17:50 INFO regex.RegexURLNormalizer: can't find rules for scope 'inject', using default 11/08/01 15:17:50 INFO mapred.JobClient: map 0% reduce 0% 11/08/01 15:17:51 INFO mapred.TaskRunner: Task:attempt_local_0001_m_00_0 is done. And is in the process of commiting 11/08/01 15:17:51 INFO
Re: Nutch 2 and Cassandra
Hi I've been watching progress on this thread with interest and think that this would be a great addition to the wiki under the following page [1] I am happy to write it up, however is there anything else we need to be aware of in addition to the material you have provided, for example some latent info that has been assumed or not been explained. Thank you [1] http://wiki.apache.org/nutch/ErrorMessagesInNutch2 On Tue, Aug 2, 2011 at 6:32 PM, Tom Davidson tdavid...@covario.com wrote: I found the problem. I am using Cloudera CDH3 and it has a hue plugins jar with an older thrift library in it. I removed the jar from my classpath and all is good. Thanks for your help. -Original Message- From: Tom Davidson [mailto:tdavid...@covario.com] Sent: Monday, August 01, 2011 3:29 PM To: dev@nutch.apache.org Subject: RE: Nutch 2 and Cassandra OK... Are you running with a clustered version of Hadoop? I think you have to have your HADOOP_HOME env variable set. Otherwise it runs in local mode. I have been able to run in local mode, but not in deployed mode. -Original Message- From: Alexis [mailto:alexis.detregl...@gmail.com] Sent: Monday, August 01, 2011 3:25 PM To: dev@nutch.apache.org Subject: Re: Nutch 2 and Cassandra Ok this version of hector was properly resolved. Thanks! These are the logs: ~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject ~/java/workspace/Nutch/seeds 11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: starting 11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: urlDir: /home/alex/java/workspace/Nutch/seeds 11/08/01 15:17:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 11/08/01 15:17:46 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s 11/08/01 15:17:46 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Test Cluster:ServiceType=hector,MonitorType=hector 11/08/01 15:17:47 INFO store.CassandraClient: Keyspace 'webpage' in cluster 'Test Cluster' was created on host 'localhost' 11/08/01 15:17:48 INFO input.FileInputFormat: Total input paths to process : 1 11/08/01 15:17:49 INFO mapred.JobClient: Running job: job_local_0001 11/08/01 15:17:49 INFO input.FileInputFormat: Total input paths to process : 1 11/08/01 15:17:49 INFO mapreduce.GoraRecordWriter: gora.buffer.write.limit = 1 11/08/01 15:17:49 INFO plugin.PluginRepository: Plugins: looking in: /tmp/hadoop-alex/hadoop-unjar8045717865743865180/plugins 11/08/01 15:17:49 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true] 11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Plugins: 11/08/01 15:17:49 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints) 11/08/01 15:17:49 INFO plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic) 11/08/01 15:17:49 INFO plugin.PluginRepository: Basic Indexing Filter (index-basic) 11/08/01 15:17:49 INFO plugin.PluginRepository: Html Parse Plug-in (parse-html) 11/08/01 15:17:49 INFO plugin.PluginRepository: HTTP Framework (lib-http) 11/08/01 15:17:49 INFO plugin.PluginRepository: Pass-through URL Normalizer (urlnormalizer-pass) 11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL Filter (urlfilter-regex) 11/08/01 15:17:49 INFO plugin.PluginRepository: Http Protocol Plug-in (protocol-http) 11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL Normalizer (urlnormalizer-regex) 11/08/01 15:17:49 INFO plugin.PluginRepository: Tika Parser Plug-in (parse-tika) 11/08/01 15:17:49 INFO plugin.PluginRepository: OPIC Scoring Plug-in (scoring-opic) 11/08/01 15:17:49 INFO plugin.PluginRepository: CyberNeko HTML Parser (lib-nekohtml) 11/08/01 15:17:49 INFO plugin.PluginRepository: Anchor Indexing Filter (index-anchor) 11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL Filter Framework (lib-regex-filter) 11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Extension-Points: 11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Protocol (org.apache.nutch.protocol.Protocol) 11/08/01 15:17:49 INFO plugin.PluginRepository: Parse Filter (org.apache.nutch.parse.ParseFilter) 11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch URL Filter (org.apache.nutch.net.URLFilter) 11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Content Parser (org.apache.nutch.parse.Parser) 11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 11/08/01 15:17:50 INFO conf.Configuration: found
RE: Nutch 2 and Cassandra
I did run into a couple more problems running Nutch 2 with CDH3. See https://issues.apache.org/jira/browse/NUTCH-937. I added a comment on the thread explaining my additional problem. I worked around the problem by unjarring the nutch-2-dev.job and seeting the HADOOP_CLASSPATH (see below) environment variable. Not an ideal solution, but it works. In order to run Nutch 2 on CDH3 I added the following to nutch-site.xml and rebuilt the nutch-2-dev.job: property namemapreduce.job.jar.unpack.pattern/name value(?:classes/|lib/|plugins/).*/value /property property nameplugin.folders/name value${job.local.dir}/../jars/plugins/value /property And I had to set this environment variable to my expanded plugins folder: export HADOOP_OPTS=-Djob.local.dir=/MY HOME/nutch/plugins From: lewis john mcgibbney [mailto:lewis.mcgibb...@gmail.com] Sent: Tuesday, August 02, 2011 2:00 PM To: dev@nutch.apache.org Subject: Re: Nutch 2 and Cassandra Hi I've been watching progress on this thread with interest and think that this would be a great addition to the wiki under the following page [1] I am happy to write it up, however is there anything else we need to be aware of in addition to the material you have provided, for example some latent info that has been assumed or not been explained. Thank you [1] http://wiki.apache.org/nutch/ErrorMessagesInNutch2 On Tue, Aug 2, 2011 at 6:32 PM, Tom Davidson tdavid...@covario.commailto:tdavid...@covario.com wrote: I found the problem. I am using Cloudera CDH3 and it has a hue plugins jar with an older thrift library in it. I removed the jar from my classpath and all is good. Thanks for your help. -Original Message- From: Tom Davidson [mailto:tdavid...@covario.commailto:tdavid...@covario.com] Sent: Monday, August 01, 2011 3:29 PM To: dev@nutch.apache.orgmailto:dev@nutch.apache.org Subject: RE: Nutch 2 and Cassandra OK... Are you running with a clustered version of Hadoop? I think you have to have your HADOOP_HOME env variable set. Otherwise it runs in local mode. I have been able to run in local mode, but not in deployed mode. -Original Message- From: Alexis [mailto:alexis.detregl...@gmail.commailto:alexis.detregl...@gmail.com] Sent: Monday, August 01, 2011 3:25 PM To: dev@nutch.apache.orgmailto:dev@nutch.apache.org Subject: Re: Nutch 2 and Cassandra Ok this version of hector was properly resolved. Thanks! These are the logs: ~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject ~/java/workspace/Nutch/seeds 11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: starting 11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: urlDir: /home/alex/java/workspace/Nutch/seeds 11/08/01 15:17:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 11/08/01 15:17:46 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s 11/08/01 15:17:46 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Test Cluster:ServiceType=hector,MonitorType=hector 11/08/01 15:17:47 INFO store.CassandraClient: Keyspace 'webpage' in cluster 'Test Cluster' was created on host 'localhost' 11/08/01 15:17:48 INFO input.FileInputFormat: Total input paths to process : 1 11/08/01 15:17:49 INFO mapred.JobClient: Running job: job_local_0001 11/08/01 15:17:49 INFO input.FileInputFormat: Total input paths to process : 1 11/08/01 15:17:49 INFO mapreduce.GoraRecordWriter: gora.buffer.write.limit = 1 11/08/01 15:17:49 INFO plugin.PluginRepository: Plugins: looking in: /tmp/hadoop-alex/hadoop-unjar8045717865743865180/plugins 11/08/01 15:17:49 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true] 11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Plugins: 11/08/01 15:17:49 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints) 11/08/01 15:17:49 INFO plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic) 11/08/01 15:17:49 INFO plugin.PluginRepository: Basic Indexing Filter (index-basic) 11/08/01 15:17:49 INFO plugin.PluginRepository: Html Parse Plug-in (parse-html) 11/08/01 15:17:49 INFO plugin.PluginRepository: HTTP Framework (lib-http) 11/08/01 15:17:49 INFO plugin.PluginRepository: Pass-through URL Normalizer (urlnormalizer-pass) 11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL Filter (urlfilter-regex) 11/08/01 15:17:49 INFO plugin.PluginRepository: Http Protocol Plug-in (protocol-http) 11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL Normalizer (urlnormalizer-regex) 11/08/01 15:17:49 INFO plugin.PluginRepository: Tika Parser Plug-in (parse-tika) 11/08/01 15:17:49 INFO plugin.PluginRepository: OPIC Scoring Plug-in (scoring-opic) 11/08/01 15:17:49 INFO plugin.PluginRepository: CyberNeko HTML Parser (lib-nekohtml) 11/08/01 15:17:49 INFO
Re: Nutch 2 and Cassandra
Hi, libthrift is a dependency of cassandra-thrift, as listed here: http://mvnrepository.com/artifact/org.apache.cassandra/cassandra-thrift/0.8.1 During Nutch build, you have to manually tweak the Ivy configuration depending on your choice of the Gora store, in this case Cassandra. Basically you need to add all the dependencies listed there: http://svn.apache.org/viewvc/incubator/gora/trunk/gora-cassandra/ivy/ivy.xml?view=markup Let's try to add to $NUTCH_HOME/ivy/ivy.xml the following dependencies and then let's rebuild Nutch (see attached patch): dependency org=org.apache.gora name=gora-cassandra rev=0.2-incubating conf=*-compile/ dependency org=org.apache.cassandra name=cassandra-thrift rev=0.8.1/ dependency org=com.ecyrd.speed4j name=speed4j rev=0.9 conf=*-*,!javadoc,!sources/ dependency org=com.github.stephenc.high-scale-lib name=high-scale-lib rev=1.1.2 conf=*-*,!javadoc,!sources/ dependency org=com.google.collections name=google-collections rev=1.0 conf=*-*,!javadoc,!sources/ dependency org=com.google.guava name=guava rev=r09 conf=*-*,!javadoc,!sources/ $ ant clean $ ant In your case libthrift should now be downloaded by Ivy and then bundled into the nutch-2.0-dev.job file. I'm not sure how apache-cassandra and hector got included in your classpath... Somehow we need to resolve as well: dependency org=org.apache.cassandra name=apache-cassandra rev=0.8.1/ dependency org=me.prettyprint name=hector rev=0.8.0-1/ I don't think the following 2 jars are in the default maven repository so they won't be downloaded, that's why they were commented in the Gora Cassandra Ivy config (gora/trunk/gora-cassandra/ivy/ivy.xml) Since hector jar is not found in my case I get: ~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject ~/java/workspace/Nutch/seeds 11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: starting 11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: urlDir: /home/alex/java/workspace/Nutch/seeds 11/08/01 14:18:42 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=30 11/08/01 14:18:42 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 11/08/01 14:18:42 ERROR crawl.InjectorJob: InjectorJob: org.apache.gora.util.GoraException: java.lang.reflect.InvocationTargetException at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110) at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93) at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:59) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243) at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69) at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597) at org.apache.hadoop.util.RunJar.main(RunJar.java:192) Caused by: java.lang.reflect.InvocationTargetException at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27) at java.lang.reflect.Constructor.newInstance(Constructor.java:513) at org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:76) at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:102) ... 12 more Caused by: java.lang.NoClassDefFoundError: me/prettyprint/hector/api/Serializer at org.apache.gora.cassandra.store.CassandraStore.init(CassandraStore.java:60) ... 18 more Caused by: java.lang.ClassNotFoundException: me.prettyprint.hector.api.Serializer at java.net.URLClassLoader$1.run(URLClassLoader.java:202) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:190) at java.lang.ClassLoader.loadClass(ClassLoader.java:306) at java.lang.ClassLoader.loadClass(ClassLoader.java:247) ... 19 more On Mon, Aug 1, 2011 at 11:59 AM, Tom Davidson tdavid...@covario.com wrote: Hi All, I am kind of at my wit’s end here, so I am hoping someone here can help. I am trying to use Nutch2 and Cassandra and I have been successful using the runtime/local build. I am using the Cloudera CDH3 on CentOs 5 and I do not want to
Re: Nutch 2 and Cassandra
INFO jvm.JvmMetrics: Cannot initialize JVM Metrics with processName=JobTracker, sessionId= - already initialized 11/08/01 15:17:52 INFO crawl.InjectorJob: InjectorJob: finished This is what was added to ivy/ivy.xml: + dependency org=org.apache.gora name=gora-cassandra rev=0.2-incubating conf=*-compile/ + dependency org=org.apache.cassandra name=cassandra-thrift rev=0.8.1/ + dependency org=com.ecyrd.speed4j name=speed4j rev=0.9 conf=*-*,!javadoc,!sources/ + dependency org=com.github.stephenc.high-scale-lib name=high-scale-lib rev=1.1.2 conf=*-*,!javadoc,!sources/ + dependency org=com.google.collections name=google-collections rev=1.0 conf=*-*,!javadoc,!sources/ + dependency org=com.google.guava name=guava rev=r09 conf=*-*,!javadoc,!sources/ + dependency org=org.apache.cassandra name=apache-cassandra rev=0.8.1/ + dependency org=me.prettyprint name=hector-core rev=0.8.0-2/ On Mon, Aug 1, 2011 at 2:55 PM, Tom Davidson tdavid...@covario.com wrote: I did something similar to below to add the Cassandra dependencies. Note that I am getting NoSuchMethodErrors not ClassNotFoundExceptions. Can you add the hector jars to your nutch job jar and see what you get? I think I am one step ahead of you. BTW, I just added this line to get the hector dependency: dependency org=me.prettyprint name=hector-core rev=0.8.0-2 conf=*-default/ -Original Message- From: Alexis [mailto:alexis.detregl...@gmail.com] Sent: Monday, August 01, 2011 2:28 PM To: dev@nutch.apache.org Subject: Re: Nutch 2 and Cassandra Hi, libthrift is a dependency of cassandra-thrift, as listed here: http://mvnrepository.com/artifact/org.apache.cassandra/cassandra-thrift/0.8.1 During Nutch build, you have to manually tweak the Ivy configuration depending on your choice of the Gora store, in this case Cassandra. Basically you need to add all the dependencies listed there: http://svn.apache.org/viewvc/incubator/gora/trunk/gora-cassandra/ivy/ivy.xml?view=markup Let's try to add to $NUTCH_HOME/ivy/ivy.xml the following dependencies and then let's rebuild Nutch (see attached patch): dependency org=org.apache.gora name=gora-cassandra rev=0.2-incubating conf=*-compile/ dependency org=org.apache.cassandra name=cassandra-thrift rev=0.8.1/ dependency org=com.ecyrd.speed4j name=speed4j rev=0.9 conf=*-*,!javadoc,!sources/ dependency org=com.github.stephenc.high-scale-lib name=high-scale-lib rev=1.1.2 conf=*-*,!javadoc,!sources/ dependency org=com.google.collections name=google-collections rev=1.0 conf=*-*,!javadoc,!sources/ dependency org=com.google.guava name=guava rev=r09 conf=*-*,!javadoc,!sources/ $ ant clean $ ant In your case libthrift should now be downloaded by Ivy and then bundled into the nutch-2.0-dev.job file. I'm not sure how apache-cassandra and hector got included in your classpath... Somehow we need to resolve as well: dependency org=org.apache.cassandra name=apache-cassandra rev=0.8.1/ dependency org=me.prettyprint name=hector rev=0.8.0-1/ I don't think the following 2 jars are in the default maven repository so they won't be downloaded, that's why they were commented in the Gora Cassandra Ivy config (gora/trunk/gora-cassandra/ivy/ivy.xml) Since hector jar is not found in my case I get: ~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject ~/java/workspace/Nutch/seeds 11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: starting 11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: urlDir: /home/alex/java/workspace/Nutch/seeds 11/08/01 14:18:42 INFO security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=30 11/08/01 14:18:42 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 11/08/01 14:18:42 ERROR crawl.InjectorJob: InjectorJob: org.apache.gora.util.GoraException: java.lang.reflect.InvocationTargetException at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110) at org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93) at org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:59) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243) at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268) at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69) at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25) at java.lang.reflect.Method.invoke(Method.java:597
RE: Nutch 2 and Cassandra
OK... Are you running with a clustered version of Hadoop? I think you have to have your HADOOP_HOME env variable set. Otherwise it runs in local mode. I have been able to run in local mode, but not in deployed mode. -Original Message- From: Alexis [mailto:alexis.detregl...@gmail.com] Sent: Monday, August 01, 2011 3:25 PM To: dev@nutch.apache.org Subject: Re: Nutch 2 and Cassandra Ok this version of hector was properly resolved. Thanks! These are the logs: ~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject ~/java/workspace/Nutch/seeds 11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: starting 11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: urlDir: /home/alex/java/workspace/Nutch/seeds 11/08/01 15:17:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with processName=JobTracker, sessionId= 11/08/01 15:17:46 INFO connection.CassandraHostRetryService: Downed Host Retry service started with queue size -1 and retry delay 10s 11/08/01 15:17:46 INFO service.JmxMonitor: Registering JMX me.prettyprint.cassandra.service_Test Cluster:ServiceType=hector,MonitorType=hector 11/08/01 15:17:47 INFO store.CassandraClient: Keyspace 'webpage' in cluster 'Test Cluster' was created on host 'localhost' 11/08/01 15:17:48 INFO input.FileInputFormat: Total input paths to process : 1 11/08/01 15:17:49 INFO mapred.JobClient: Running job: job_local_0001 11/08/01 15:17:49 INFO input.FileInputFormat: Total input paths to process : 1 11/08/01 15:17:49 INFO mapreduce.GoraRecordWriter: gora.buffer.write.limit = 1 11/08/01 15:17:49 INFO plugin.PluginRepository: Plugins: looking in: /tmp/hadoop-alex/hadoop-unjar8045717865743865180/plugins 11/08/01 15:17:49 INFO plugin.PluginRepository: Plugin Auto-activation mode: [true] 11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Plugins: 11/08/01 15:17:49 INFO plugin.PluginRepository: the nutch core extension points (nutch-extensionpoints) 11/08/01 15:17:49 INFO plugin.PluginRepository: Basic URL Normalizer (urlnormalizer-basic) 11/08/01 15:17:49 INFO plugin.PluginRepository: Basic Indexing Filter (index-basic) 11/08/01 15:17:49 INFO plugin.PluginRepository: Html Parse Plug-in (parse-html) 11/08/01 15:17:49 INFO plugin.PluginRepository: HTTP Framework (lib-http) 11/08/01 15:17:49 INFO plugin.PluginRepository: Pass-through URL Normalizer (urlnormalizer-pass) 11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL Filter (urlfilter-regex) 11/08/01 15:17:49 INFO plugin.PluginRepository: Http Protocol Plug-in (protocol-http) 11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL Normalizer (urlnormalizer-regex) 11/08/01 15:17:49 INFO plugin.PluginRepository: Tika Parser Plug-in (parse-tika) 11/08/01 15:17:49 INFO plugin.PluginRepository: OPIC Scoring Plug-in (scoring-opic) 11/08/01 15:17:49 INFO plugin.PluginRepository: CyberNeko HTML Parser (lib-nekohtml) 11/08/01 15:17:49 INFO plugin.PluginRepository: Anchor Indexing Filter (index-anchor) 11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL Filter Framework (lib-regex-filter) 11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Extension-Points: 11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch URL Normalizer (org.apache.nutch.net.URLNormalizer) 11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Protocol (org.apache.nutch.protocol.Protocol) 11/08/01 15:17:49 INFO plugin.PluginRepository: Parse Filter (org.apache.nutch.parse.ParseFilter) 11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch URL Filter (org.apache.nutch.net.URLFilter) 11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Indexing Filter (org.apache.nutch.indexer.IndexingFilter) 11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Content Parser (org.apache.nutch.parse.Parser) 11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Scoring (org.apache.nutch.scoring.ScoringFilter) 11/08/01 15:17:50 INFO conf.Configuration: found resource regex-normalize.xml at file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-normalize.xml 11/08/01 15:17:50 INFO conf.Configuration: found resource regex-urlfilter.txt at file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-urlfilter.txt 11/08/01 15:17:50 INFO regex.RegexURLNormalizer: can't find rules for scope 'inject', using default 11/08/01 15:17:50 INFO mapred.JobClient: map 0% reduce 0% 11/08/01 15:17:51 INFO mapred.TaskRunner: Task:attempt_local_0001_m_00_0 is done. And is in the process of commiting 11/08/01 15:17:51 INFO mapred.LocalJobRunner: 11/08/01 15:17:51 INFO mapred.TaskRunner: Task 'attempt_local_0001_m_00_0' done. 11/08/01 15:17:52 INFO mapred.JobClient: map 100% reduce 0% 11/08/01 15:17:52 INFO mapred.JobClient: Job complete: job_local_0001 11/08/01 15:17:52 INFO mapred.JobClient: Counters: 5 11/08/01 15:17:52 INFO mapred.JobClient: FileSystemCounters 11/08/01 15:17:52 INFO