[jira] [Updated] (NUTCH-1075) Delegate language identification to Tika

2011-08-01 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1075?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1075:
-

Attachment: NUTCH-1075.patch

Passes the tests but requires some testing

> Delegate language identification to Tika
> 
>
> Key: NUTCH-1075
> URL: https://issues.apache.org/jira/browse/NUTCH-1075
> Project: Nutch
>  Issue Type: Improvement
>  Components: parser
>Affects Versions: 1.4
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.4
>
> Attachments: NUTCH-1075.patch
>
>
> In 2.0 the language identification is delegated to Tika and is done as part 
> of the parsing step (and not during the indexing as done currently).
> The patch attached is a backport from trunk which implements this and adds a 
> new parameter to determine the strategy to use
> {code:xml} 
> 
>   lang.extraction.policy
>   detect,identify
>   This determines when the plugin uses detection and
>   statistical identification mechanisms. The order in which the
>   detect and identify are written will determine the extraction
>   policy. Default case (detect,identify)  means the plugin will
>   first try to extract language info from page headers and metadata,
>   if this is not successful it will try using tika language
>   identification. Possible values are:
> detect
> identify
> detect,identify
> identify,detect
>   
> 
> {code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Created] (NUTCH-1075) Delegate language identification to Tika

2011-08-01 Thread Julien Nioche (JIRA)
Delegate language identification to Tika


 Key: NUTCH-1075
 URL: https://issues.apache.org/jira/browse/NUTCH-1075
 Project: Nutch
  Issue Type: Improvement
  Components: parser
Affects Versions: 1.4
Reporter: Julien Nioche
Assignee: Julien Nioche
 Fix For: 1.4


In 2.0 the language identification is delegated to Tika and is done as part of 
the parsing step (and not during the indexing as done currently).
The patch attached is a backport from trunk which implements this and adds a 
new parameter to determine the strategy to use

{code:xml} 

  lang.extraction.policy
  detect,identify
  This determines when the plugin uses detection and
  statistical identification mechanisms. The order in which the
  detect and identify are written will determine the extraction
  policy. Default case (detect,identify)  means the plugin will
  first try to extract language info from page headers and metadata,
  if this is not successful it will try using tika language
  identification. Possible values are:
detect
identify
detect,identify
identify,detect
  

{code} 

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Commented] (NUTCH-1044) Redirected URLs and possibly all of their outlinked URLs have invalid scores.

2011-08-01 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13076043#comment-13076043
 ] 

Julien Nioche commented on NUTCH-1044:
--

Will commit soon if there aren't any objections

> Redirected URLs and possibly all of their outlinked URLs have invalid scores.
> -
>
> Key: NUTCH-1044
> URL: https://issues.apache.org/jira/browse/NUTCH-1044
> Project: Nutch
>  Issue Type: Bug
>  Components: fetcher, parser
>Affects Versions: 1.3
>Reporter: Nutch User - 1
>Assignee: Julien Nioche
>Priority: Critical
> Fix For: 1.4
>
> Attachments: NUTCH-1044-1.4.patch
>
>
> 1.: 
> http://lucene.472066.n3.nabble.com/URL-redirection-and-zero-scores-td3085311.html
> 2.: 
> http://lucene.472066.n3.nabble.com/A-possible-solution-to-my-URL-redirection-and-zero-scores-problem-td3162164.html
> Please note that also URLs redirected by meta refresh redirection do have 
> invalid scores. For such URLs a CrawlDatum is created on the lines 157-177 of 
> ParseOutputFormat.java 
> (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/parse/ParseOutputFormat.java?view=markup).
>  The new CrawlDatum's score isn't set anywhere after the creation so it's 
> 1.0f as can be seen on the line 122 of CrawlDatum.java 
> (http://svn.apache.org/viewvc/nutch/branches/branch-1.3/src/java/org/apache/nutch/crawl/CrawlDatum.java?view=markup).
> It's another question whether the redirected URL's score should be just 
> passed to the new URL or should the redirection be considered as a link in 
> which case the new URL's score would be 'originalScore' / ('numberOfOutlinks' 
> + 1).

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira




Build failed in Jenkins: Nutch-trunk #1564

2011-08-01 Thread Apache Jenkins Server
See 

--
[...truncated 925 lines...]
A src/plugin/parse-tika/plugin.xml
A src/plugin/parse-tika/build.xml
A src/plugin/lib-regex-filter
A src/plugin/lib-regex-filter/ivy.xml
A src/plugin/lib-regex-filter/src
A src/plugin/lib-regex-filter/src/test
A src/plugin/lib-regex-filter/src/test/org
A src/plugin/lib-regex-filter/src/test/org/apache
A src/plugin/lib-regex-filter/src/test/org/apache/nutch
A src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter
A src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter/api
AU
src/plugin/lib-regex-filter/src/test/org/apache/nutch/urlfilter/api/RegexURLFilterBaseTest.java
A src/plugin/lib-regex-filter/src/java
A src/plugin/lib-regex-filter/src/java/org
A src/plugin/lib-regex-filter/src/java/org/apache
A src/plugin/lib-regex-filter/src/java/org/apache/nutch
A src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter
A src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api
AU
src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexRule.java
AU
src/plugin/lib-regex-filter/src/java/org/apache/nutch/urlfilter/api/RegexURLFilterBase.java
AUsrc/plugin/lib-regex-filter/plugin.xml
AUsrc/plugin/lib-regex-filter/build.xml
A src/plugin/feed
A src/plugin/feed/sample
A src/plugin/feed/sample/rsstest.rss
A src/plugin/feed/ivy.xml
A src/plugin/feed/src
A src/plugin/feed/src/test
A src/plugin/feed/src/test/org
A src/plugin/feed/src/test/org/apache
A src/plugin/feed/src/test/org/apache/nutch
A src/plugin/feed/src/test/org/apache/nutch/parse
A src/plugin/feed/src/test/org/apache/nutch/parse/feed
A 
src/plugin/feed/src/test/org/apache/nutch/parse/feed/TestFeedParser.java
A src/plugin/feed/src/java
A src/plugin/feed/src/java/org
A src/plugin/feed/src/java/org/apache
A src/plugin/feed/src/java/org/apache/nutch
A src/plugin/feed/src/java/org/apache/nutch/parse
A src/plugin/feed/src/java/org/apache/nutch/parse/feed
A src/plugin/feed/src/java/org/apache/nutch/parse/feed/FeedParser.java
A src/plugin/feed/src/java/org/apache/nutch/indexer
A src/plugin/feed/src/java/org/apache/nutch/indexer/feed
A 
src/plugin/feed/src/java/org/apache/nutch/indexer/feed/FeedIndexingFilter.java
A src/plugin/feed/plugin.xml
A src/plugin/feed/build.xml
A src/plugin/subcollection
A src/plugin/subcollection/ivy.xml
A src/plugin/subcollection/src
A src/plugin/subcollection/src/test
A src/plugin/subcollection/src/test/org
A src/plugin/subcollection/src/test/org/apache
A src/plugin/subcollection/src/test/org/apache/nutch
A src/plugin/subcollection/src/test/org/apache/nutch/collection
A 
src/plugin/subcollection/src/test/org/apache/nutch/collection/TestSubcollection.java
A src/plugin/subcollection/src/java
A src/plugin/subcollection/src/java/org
A src/plugin/subcollection/src/java/org/apache
A src/plugin/subcollection/src/java/org/apache/nutch
A src/plugin/subcollection/src/java/org/apache/nutch/collection
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/Subcollection.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/CollectionManager.java
A 
src/plugin/subcollection/src/java/org/apache/nutch/collection/package.html
A src/plugin/subcollection/src/java/org/apache/nutch/indexer
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection
A 
src/plugin/subcollection/src/java/org/apache/nutch/indexer/subcollection/SubcollectionIndexingFilter.java
A src/plugin/subcollection/README.txt
A src/plugin/subcollection/plugin.xml
A src/plugin/subcollection/build.xml
A src/plugin/index-more
A src/plugin/index-more/ivy.xml
A src/plugin/index-more/src
A src/plugin/index-more/src/test
A src/plugin/index-more/src/test/org
A src/plugin/index-more/src/test/org/apache
A src/plugin/index-more/src/test/org/apache/nutch
A src/plugin/index-more/src/test/org/apache/nutch/indexer
A src/plugin/index-more/src/test/org/apache/nutch/indexer/more
A 
src/plugin/index-more/src/test/org/apache/nutch/indexer/more/TestMoreIndexingFilter.java
A src/plugin/index-more/src/java
A src/plugin/index-more/src/java/org
A src/plugin/index-more/src/java/org/apache
A src/plugin/index-more/src/java/org/apache/nutch
A src/plugin/index-more/src/java/org/apache/nutch/indexer
A src/plugin/index-more/src/java/org/apache/nut

RE: Nutch 2 and Cassandra

2011-08-01 Thread Tom Davidson
OK... Are you running with a clustered version of Hadoop? I think you have to 
have your HADOOP_HOME env variable set. Otherwise it runs in local mode. I have 
been able to run in local mode, but not in deployed mode.


-Original Message-
From: Alexis [mailto:alexis.detregl...@gmail.com] 
Sent: Monday, August 01, 2011 3:25 PM
To: dev@nutch.apache.org
Subject: Re: Nutch 2 and Cassandra

Ok this version of hector was properly resolved. Thanks!

These are the logs:
~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject
~/java/workspace/Nutch/seeds
11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: starting
11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: urlDir:
/home/alex/java/workspace/Nutch/seeds
11/08/01 15:17:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
11/08/01 15:17:46 INFO connection.CassandraHostRetryService: Downed
Host Retry service started with queue size -1 and retry delay 10s
11/08/01 15:17:46 INFO service.JmxMonitor: Registering JMX
me.prettyprint.cassandra.service_Test
Cluster:ServiceType=hector,MonitorType=hector
11/08/01 15:17:47 INFO store.CassandraClient: Keyspace 'webpage' in
cluster 'Test Cluster' was created on host 'localhost'
11/08/01 15:17:48 INFO input.FileInputFormat: Total input paths to process : 1
11/08/01 15:17:49 INFO mapred.JobClient: Running job: job_local_0001
11/08/01 15:17:49 INFO input.FileInputFormat: Total input paths to process : 1
11/08/01 15:17:49 INFO mapreduce.GoraRecordWriter:
gora.buffer.write.limit = 1
11/08/01 15:17:49 INFO plugin.PluginRepository: Plugins: looking in:
/tmp/hadoop-alex/hadoop-unjar8045717865743865180/plugins
11/08/01 15:17:49 INFO plugin.PluginRepository: Plugin Auto-activation
mode: [true]
11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Plugins:
11/08/01 15:17:49 INFO plugin.PluginRepository: the nutch core
extension points (nutch-extensionpoints)
11/08/01 15:17:49 INFO plugin.PluginRepository: Basic URL
Normalizer (urlnormalizer-basic)
11/08/01 15:17:49 INFO plugin.PluginRepository: Basic Indexing
Filter (index-basic)
11/08/01 15:17:49 INFO plugin.PluginRepository: Html Parse
Plug-in (parse-html)
11/08/01 15:17:49 INFO plugin.PluginRepository: HTTP Framework
(lib-http)
11/08/01 15:17:49 INFO plugin.PluginRepository: Pass-through
URL Normalizer (urlnormalizer-pass)
11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL
Filter (urlfilter-regex)
11/08/01 15:17:49 INFO plugin.PluginRepository: Http Protocol
Plug-in (protocol-http)
11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL
Normalizer (urlnormalizer-regex)
11/08/01 15:17:49 INFO plugin.PluginRepository: Tika Parser
Plug-in (parse-tika)
11/08/01 15:17:49 INFO plugin.PluginRepository: OPIC Scoring
Plug-in (scoring-opic)
11/08/01 15:17:49 INFO plugin.PluginRepository: CyberNeko HTML
Parser (lib-nekohtml)
11/08/01 15:17:49 INFO plugin.PluginRepository: Anchor
Indexing Filter (index-anchor)
11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL
Filter Framework (lib-regex-filter)
11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Extension-Points:
11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Protocol
(org.apache.nutch.protocol.Protocol)
11/08/01 15:17:49 INFO plugin.PluginRepository: Parse Filter
(org.apache.nutch.parse.ParseFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch URL
Filter (org.apache.nutch.net.URLFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Content
Parser (org.apache.nutch.parse.Parser)
11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
11/08/01 15:17:50 INFO conf.Configuration: found resource
regex-normalize.xml at
file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-normalize.xml
11/08/01 15:17:50 INFO conf.Configuration: found resource
regex-urlfilter.txt at
file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-urlfilter.txt
11/08/01 15:17:50 INFO regex.RegexURLNormalizer: can't find rules for
scope 'inject', using default
11/08/01 15:17:50 INFO mapred.JobClient:  map 0% reduce 0%
11/08/01 15:17:51 INFO mapred.TaskRunner:
Task:attempt_local_0001_m_00_0 is done. And is in the process of
commiting
11/08/01 15:17:51 INFO mapred.LocalJobRunner:
11/08/01 15:17:51 INFO mapred.TaskRunner: Task
'attempt_local_0001_m_00_0' done.
11/08/01 15:17:52 INFO mapred.JobClient:  map 100% reduce 0%
11/08/01 15:17:52 INFO mapred.JobClient: Job complete: job_local_0001
11/08/01 15:17:52 INFO mapred.JobClient: Counters: 5
11/08/01 15:17:52 INFO mapred.JobClient:   FileSystemCounters
11/08/01 15:17:52 INFO map

Re: Nutch 2 and Cassandra

2011-08-01 Thread Alexis
Ok this version of hector was properly resolved. Thanks!

These are the logs:
~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject
~/java/workspace/Nutch/seeds
11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: starting
11/08/01 15:17:45 INFO crawl.InjectorJob: InjectorJob: urlDir:
/home/alex/java/workspace/Nutch/seeds
11/08/01 15:17:45 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
11/08/01 15:17:46 INFO connection.CassandraHostRetryService: Downed
Host Retry service started with queue size -1 and retry delay 10s
11/08/01 15:17:46 INFO service.JmxMonitor: Registering JMX
me.prettyprint.cassandra.service_Test
Cluster:ServiceType=hector,MonitorType=hector
11/08/01 15:17:47 INFO store.CassandraClient: Keyspace 'webpage' in
cluster 'Test Cluster' was created on host 'localhost'
11/08/01 15:17:48 INFO input.FileInputFormat: Total input paths to process : 1
11/08/01 15:17:49 INFO mapred.JobClient: Running job: job_local_0001
11/08/01 15:17:49 INFO input.FileInputFormat: Total input paths to process : 1
11/08/01 15:17:49 INFO mapreduce.GoraRecordWriter:
gora.buffer.write.limit = 1
11/08/01 15:17:49 INFO plugin.PluginRepository: Plugins: looking in:
/tmp/hadoop-alex/hadoop-unjar8045717865743865180/plugins
11/08/01 15:17:49 INFO plugin.PluginRepository: Plugin Auto-activation
mode: [true]
11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Plugins:
11/08/01 15:17:49 INFO plugin.PluginRepository: the nutch core
extension points (nutch-extensionpoints)
11/08/01 15:17:49 INFO plugin.PluginRepository: Basic URL
Normalizer (urlnormalizer-basic)
11/08/01 15:17:49 INFO plugin.PluginRepository: Basic Indexing
Filter (index-basic)
11/08/01 15:17:49 INFO plugin.PluginRepository: Html Parse
Plug-in (parse-html)
11/08/01 15:17:49 INFO plugin.PluginRepository: HTTP Framework
(lib-http)
11/08/01 15:17:49 INFO plugin.PluginRepository: Pass-through
URL Normalizer (urlnormalizer-pass)
11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL
Filter (urlfilter-regex)
11/08/01 15:17:49 INFO plugin.PluginRepository: Http Protocol
Plug-in (protocol-http)
11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL
Normalizer (urlnormalizer-regex)
11/08/01 15:17:49 INFO plugin.PluginRepository: Tika Parser
Plug-in (parse-tika)
11/08/01 15:17:49 INFO plugin.PluginRepository: OPIC Scoring
Plug-in (scoring-opic)
11/08/01 15:17:49 INFO plugin.PluginRepository: CyberNeko HTML
Parser (lib-nekohtml)
11/08/01 15:17:49 INFO plugin.PluginRepository: Anchor
Indexing Filter (index-anchor)
11/08/01 15:17:49 INFO plugin.PluginRepository: Regex URL
Filter Framework (lib-regex-filter)
11/08/01 15:17:49 INFO plugin.PluginRepository: Registered Extension-Points:
11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch URL
Normalizer (org.apache.nutch.net.URLNormalizer)
11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Protocol
(org.apache.nutch.protocol.Protocol)
11/08/01 15:17:49 INFO plugin.PluginRepository: Parse Filter
(org.apache.nutch.parse.ParseFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch URL
Filter (org.apache.nutch.net.URLFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Indexing
Filter (org.apache.nutch.indexer.IndexingFilter)
11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Content
Parser (org.apache.nutch.parse.Parser)
11/08/01 15:17:49 INFO plugin.PluginRepository: Nutch Scoring
(org.apache.nutch.scoring.ScoringFilter)
11/08/01 15:17:50 INFO conf.Configuration: found resource
regex-normalize.xml at
file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-normalize.xml
11/08/01 15:17:50 INFO conf.Configuration: found resource
regex-urlfilter.txt at
file:/tmp/hadoop-alex/hadoop-unjar8045717865743865180/regex-urlfilter.txt
11/08/01 15:17:50 INFO regex.RegexURLNormalizer: can't find rules for
scope 'inject', using default
11/08/01 15:17:50 INFO mapred.JobClient:  map 0% reduce 0%
11/08/01 15:17:51 INFO mapred.TaskRunner:
Task:attempt_local_0001_m_00_0 is done. And is in the process of
commiting
11/08/01 15:17:51 INFO mapred.LocalJobRunner:
11/08/01 15:17:51 INFO mapred.TaskRunner: Task
'attempt_local_0001_m_00_0' done.
11/08/01 15:17:52 INFO mapred.JobClient:  map 100% reduce 0%
11/08/01 15:17:52 INFO mapred.JobClient: Job complete: job_local_0001
11/08/01 15:17:52 INFO mapred.JobClient: Counters: 5
11/08/01 15:17:52 INFO mapred.JobClient:   FileSystemCounters
11/08/01 15:17:52 INFO mapred.JobClient: FILE_BYTES_READ=44872735
11/08/01 15:17:52 INFO mapred.JobClient: FILE_BYTES_WRITTEN=45245279
11/08/01 15:17:52 INFO mapred.JobClient:   Map-Reduce Framework
11/08/01 15:17:52 INFO mapred.JobClient: Map input records=3
11/08/01 15:17:52 INFO mapred.JobClient: Spilled Records=0
11/08/01 15:17:52 INFO mapred.JobClient: Map output records=3
11/08/01 15:17:52 INFO

RE: Nutch 2 and Cassandra

2011-08-01 Thread Tom Davidson
I did something similar to below to add the Cassandra dependencies. Note that I 
am getting NoSuchMethodErrors not ClassNotFoundExceptions. Can you add the 
hector jars to your nutch job jar and see what you get? I think I am one step 
ahead of you. BTW, I just added this line to get the hector dependency:



-Original Message-
From: Alexis [mailto:alexis.detregl...@gmail.com] 
Sent: Monday, August 01, 2011 2:28 PM
To: dev@nutch.apache.org
Subject: Re: Nutch 2 and Cassandra

Hi, libthrift is a dependency of cassandra-thrift, as listed here:
http://mvnrepository.com/artifact/org.apache.cassandra/cassandra-thrift/0.8.1

During Nutch build, you have to manually tweak the Ivy configuration depending 
on your choice of the Gora store, in this case Cassandra.
Basically you need to add all the dependencies listed there:
http://svn.apache.org/viewvc/incubator/gora/trunk/gora-cassandra/ivy/ivy.xml?view=markup

Let's try to add to $NUTCH_HOME/ivy/ivy.xml the following dependencies and then 
let's rebuild Nutch (see attached patch):







$ ant clean
$ ant

In your case libthrift should now be downloaded by Ivy and then bundled into 
the nutch-2.0-dev.job file. I'm not sure how apache-cassandra and hector got 
included in your classpath...

Somehow we need to resolve as well:



I don't think the following 2 jars are in the default maven repository so they 
won't be downloaded, that's why they were commented in the Gora Cassandra Ivy 
config (gora/trunk/gora-cassandra/ivy/ivy.xml)


Since hector jar is not found in my case I get:
~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject 
~/java/workspace/Nutch/seeds
11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: starting
11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: urlDir:
/home/alex/java/workspace/Nutch/seeds
11/08/01 14:18:42 INFO security.Groups: Group mapping 
impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
cacheTimeout=30
11/08/01 14:18:42 INFO jvm.JvmMetrics: Initializing JVM Metrics with 
processName=JobTracker, sessionId=
11/08/01 14:18:42 ERROR crawl.InjectorJob: InjectorJob:
org.apache.gora.util.GoraException:
java.lang.reflect.InvocationTargetException
at 
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110)
at 
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93)
at 
org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:59)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:192)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at 
org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:76)
at 
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:102)
... 12 more
Caused by: java.lang.NoClassDefFoundError: me/prettyprint/hector/api/Serializer
at 
org.apache.gora.cassandra.store.CassandraStore.(CassandraStore.java:60)
... 18 more
Caused by: java.lang.ClassNotFoundException:
me.prettyprint.hector.api.Serializer
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 19 more




On Mon, Aug 1, 2011 at 11:59 AM, Tom Davidson  wrote:
> Hi All,
>
>
>
> I am kind of at my wit's end here, so I am hoping someone here can 
> help.  I am trying to use Nutch2 and Cassandra and I have been 
> successful using the runtime/local build. I am using the Cloudera CDH3 
> on CentOs 5 and I do not want to contaminate by hadoop install by 
> dropping in a bunch of Nutch jars, etc. So I am trying to use the 
> nutch-2-dev.job jar. When I try to use the nutc

Re: Nutch 2 and Cassandra

2011-08-01 Thread Alexis
Hi, libthrift is a dependency of cassandra-thrift, as listed here:
http://mvnrepository.com/artifact/org.apache.cassandra/cassandra-thrift/0.8.1

During Nutch build, you have to manually tweak the Ivy configuration
depending on your choice of the Gora store, in this case Cassandra.
Basically you need to add all the dependencies listed there:
http://svn.apache.org/viewvc/incubator/gora/trunk/gora-cassandra/ivy/ivy.xml?view=markup

Let's try to add to $NUTCH_HOME/ivy/ivy.xml the following dependencies
and then let's rebuild Nutch (see attached patch):







$ ant clean
$ ant

In your case libthrift should now be downloaded by Ivy and then
bundled into the nutch-2.0-dev.job file. I'm not sure how
apache-cassandra and hector got included in your classpath...

Somehow we need to resolve as well:



I don't think the following 2 jars are in the default maven repository
so they won't be downloaded, that's why they were commented in the
Gora Cassandra Ivy config (gora/trunk/gora-cassandra/ivy/ivy.xml)


Since hector jar is not found in my case I get:
~/java/workspace/Nutch/trunk/runtime/deploy$ bin/nutch inject
~/java/workspace/Nutch/seeds
11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: starting
11/08/01 14:18:42 INFO crawl.InjectorJob: InjectorJob: urlDir:
/home/alex/java/workspace/Nutch/seeds
11/08/01 14:18:42 INFO security.Groups: Group mapping
impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping;
cacheTimeout=30
11/08/01 14:18:42 INFO jvm.JvmMetrics: Initializing JVM Metrics with
processName=JobTracker, sessionId=
11/08/01 14:18:42 ERROR crawl.InjectorJob: InjectorJob:
org.apache.gora.util.GoraException:
java.lang.reflect.InvocationTargetException
at 
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110)
at 
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93)
at 
org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:59)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:69)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:192)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:513)
at 
org.apache.gora.util.ReflectionUtils.newInstance(ReflectionUtils.java:76)
at 
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:102)
... 12 more
Caused by: java.lang.NoClassDefFoundError: me/prettyprint/hector/api/Serializer
at 
org.apache.gora.cassandra.store.CassandraStore.(CassandraStore.java:60)
... 18 more
Caused by: java.lang.ClassNotFoundException:
me.prettyprint.hector.api.Serializer
at java.net.URLClassLoader$1.run(URLClassLoader.java:202)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:190)
at java.lang.ClassLoader.loadClass(ClassLoader.java:306)
at java.lang.ClassLoader.loadClass(ClassLoader.java:247)
... 19 more




On Mon, Aug 1, 2011 at 11:59 AM, Tom Davidson  wrote:
> Hi All,
>
>
>
> I am kind of at my wit’s end here, so I am hoping someone here can help.  I
> am trying to use Nutch2 and Cassandra and I have been successful using the
> runtime/local build. I am using the Cloudera CDH3 on CentOs 5 and I do not
> want to contaminate by hadoop install by dropping in a bunch of Nutch jars,
> etc. So I am trying to use the nutch-2-dev.job jar. When I try to use the
> nutch2-dev.job jar, I get the error below.  I have double and triple checked
> the classpath and the included jars and the only jar that contains
> FieldValueMetaData is the libthrift-0.6.1.jar which has the method that is
> claimed to be missing. Any ideas?
>
>
>
> Thanks,
>
> Tom
>
>
>
>
>
>
>
>
>
> [tdavidson@nadevsan06 ~]$ bin/nutch inject urls
>
> /opt/jdk1.6.0_21/bin/java -Dproc_jar -Xmx1000m
> -Dhadoop.log.dir=/usr/lib/hadoop-0.20/logs -Dhadoop.log.file=hadoop.log
> -Dhadoop.home.dir=/usr/lib/hadoop-0.

Nutch 2 and Cassandra

2011-08-01 Thread Tom Davidson
Hi All,

I am kind of at my wit's end here, so I am hoping someone here can help.  I am 
trying to use Nutch2 and Cassandra and I have been successful using the 
runtime/local build. I am using the Cloudera CDH3 on CentOs 5 and I do not want 
to contaminate by hadoop install by dropping in a bunch of Nutch jars, etc. So 
I am trying to use the nutch-2-dev.job jar. When I try to use the 
nutch2-dev.job jar, I get the error below.  I have double and triple checked 
the classpath and the included jars and the only jar that contains 
FieldValueMetaData is the libthrift-0.6.1.jar which has the method that is 
claimed to be missing. Any ideas?

Thanks,
Tom




[tdavidson@nadevsan06 ~]$ bin/nutch inject urls
/opt/jdk1.6.0_21/bin/java -Dproc_jar -Xmx1000m 
-Dhadoop.log.dir=/usr/lib/hadoop-0.20/logs -Dhadoop.log.file=hadoop.log 
-Dhadoop.home.dir=/usr/lib/hadoop-0.20 -Dhadoop.id.str=tdavidson 
-Dhadoop.root.logger=INFO,console 
-Djava.library.path=/usr/lib/hadoop-0.20/lib/native/Linux-amd64-64 
-Dhadoop.policy.file=hadoop-policy.xml -classpath 
/usr/lib/hadoop-0.20/conf:/opt/jdk1.6.0_21/lib/tools.jar:/usr/lib/hadoop-0.20:/usr/lib/hadoop-0.20/hadoop-core-0.20.2-cdh3u1.jar:/usr/lib/hadoop-0.20/lib/ant-contrib-1.0b3.jar:/usr/lib/hadoop-0.20/lib/aspectjrt-1.6.5.jar:/usr/lib/hadoop-0.20/lib/aspectjtools-1.6.5.jar:/usr/lib/hadoop-0.20/lib/commons-cli-1.2.jar:/usr/lib/hadoop-0.20/lib/commons-codec-1.4.jar:/usr/lib/hadoop-0.20/lib/commons-daemon-1.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-el-1.0.jar:/usr/lib/hadoop-0.20/lib/commons-httpclient-3.0.1.jar:/usr/lib/hadoop-0.20/lib/commons-logging-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-logging-api-1.0.4.jar:/usr/lib/hadoop-0.20/lib/commons-net-1.4.1.jar:/usr/lib/hadoop-0.20/lib/core-3.1.1.jar:/usr/lib/hadoop-0.20/lib/hadoop-fairscheduler-0.20.2-cdh3u1.jar:/usr/lib/hadoop-0.20/lib/hsqldb-1.8.0.10.jar:/usr/lib/hadoop-0.20/lib/hue-plugins-1.2.0-cdh3u1.jar:/usr/lib/hadoop-0.20/lib/jackson-core-asl-1.5.2.jar:/usr/lib/hadoop-0.20/lib/jackson-mapper-asl-1.5.2.jar:/usr/lib/hadoop-0.20/lib/jasper-compiler-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jasper-runtime-5.5.12.jar:/usr/lib/hadoop-0.20/lib/jets3t-0.6.1.jar:/usr/lib/hadoop-0.20/lib/jetty-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jetty-servlet-tester-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jetty-util-6.1.26.jar:/usr/lib/hadoop-0.20/lib/jsch-0.1.42.jar:/usr/lib/hadoop-0.20/lib/junit-4.5.jar:/usr/lib/hadoop-0.20/lib/kfs-0.2.2.jar:/usr/lib/hadoop-0.20/lib/log4j-1.2.15.jar:/usr/lib/hadoop-0.20/lib/mockito-all-1.8.2.jar:/usr/lib/hadoop-0.20/lib/oro-2.0.8.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-20081211.jar:/usr/lib/hadoop-0.20/lib/servlet-api-2.5-6.1.14.jar:/usr/lib/hadoop-0.20/lib/slf4j-api-1.4.3.jar:/usr/lib/hadoop-0.20/lib/slf4j-log4j12-1.4.3.jar:/usr/lib/hadoop-0.20/lib/xmlenc-0.52.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-2.1.jar:/usr/lib/hadoop-0.20/lib/jsp-2.1/jsp-api-2.1.jar
 org.apache.hadoop.util.RunJar /home/SEMDIRECTOR/tdavidson/nutch-2.job 
org.apache.nutch.crawl.InjectorJob urls
11/08/01 11:51:54 INFO crawl.InjectorJob: InjectorJob: starting
11/08/01 11:51:54 INFO crawl.InjectorJob: InjectorJob: urlDir: urls
11/08/01 11:51:55 INFO connection.CassandraHostRetryService: Downed Host Retry 
service started with queue size -1 and retry delay 10s
11/08/01 11:51:55 INFO service.JmxMonitor: Registering JMX 
me.prettyprint.cassandra.service_Test 
Cluster:ServiceType=hector,MonitorType=hector
11/08/01 11:51:55 ERROR crawl.InjectorJob: InjectorJob: 
org.apache.gora.util.GoraException: java.lang.reflect.InvocationTargetException
at 
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:110)
at 
org.apache.gora.store.DataStoreFactory.createDataStore(DataStoreFactory.java:93)
at 
org.apache.nutch.storage.StorageUtils.createWebStore(StorageUtils.java:59)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:243)
at org.apache.nutch.crawl.InjectorJob.inject(InjectorJob.java:268)
at org.apache.nutch.crawl.InjectorJob.run(InjectorJob.java:282)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.InjectorJob.main(InjectorJob.java:292)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:186)
Caused by: java.lang.reflect.InvocationTargetException
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at 
sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:39)
at 
sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:27)
at java.lang.reflect.Constructor.newInstance(Constructor.java:5

[jira] [Created] (NUTCH-1074) topN is ignored with maxNumSegments

2011-08-01 Thread Markus Jelsma (JIRA)
topN is ignored with maxNumSegments
---

 Key: NUTCH-1074
 URL: https://issues.apache.org/jira/browse/NUTCH-1074
 Project: Nutch
  Issue Type: Bug
  Components: generator
Affects Versions: 1.3
Reporter: Markus Jelsma
 Fix For: 1.4


When generating segments with topN and maxNumSegments, topN is not respected. 
It looks like the first generated segment contains topN * maxNumSegments of 
URLs's, at least the number of map input records roughly matches.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira