Build failed in Jenkins: Nutch-trunk #1795

2012-03-22 Thread Apache Jenkins Server
See 

--
Started by timer
Building remotely on solaris1 in workspace 

hudson.util.IOException2: remote file operation failed: 
 at 
hudson.remoting.Channel@b71a03c:solaris1
at hudson.FilePath.act(FilePath.java:828)
at hudson.FilePath.act(FilePath.java:814)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:743)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:685)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1212)
at 
hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:579)
at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:468)
at hudson.model.Run.run(Run.java:1410)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:238)
Caused by: java.io.IOException: Remote call on solaris1 failed
at hudson.remoting.Channel.call(Channel.java:690)
at hudson.FilePath.act(FilePath.java:821)
... 10 more
Caused by: java.lang.LinkageError: duplicate class definition: 
hudson/model/Descriptor
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
at java.lang.ClassLoader.defineClass(ClassLoader.java:466)
at 
hudson.remoting.RemoteClassLoader.loadClassFile(RemoteClassLoader.java:152)
at 
hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:131)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2259)
at java.lang.Class.getDeclaredField(Class.java:1852)
at 
java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1582)
at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:52)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:408)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.(ObjectStreamClass.java:400)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:297)
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:531)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1699)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
at hudson.remoting.UserRequest.deserialize(UserRequest.java:182)
at hudson.remoting.UserRequest.perform(UserRequest.java:98)
at hudson.remoting.UserRequest.perform(UserRequest.java:48)
at hudson.remoting.Request$2.run(Request.java:287)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
at java.util.concurrent.FutureTask.run(FutureTask.java:123)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:651)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:676)
at java.lang.Thread.run(Thread.java:595)
Retrying after 10 seconds
hudson.util.IOExcepti

Build failed in Jenkins: Nutch-nutchgora #204

2012-03-22 Thread Apache Jenkins Server
See 

--
Started by timer
Building remotely on solaris1 in workspace 

hudson.util.IOException2: remote file operation failed: 
 at 
hudson.remoting.Channel@b71a03c:solaris1
at hudson.FilePath.act(FilePath.java:828)
at hudson.FilePath.act(FilePath.java:814)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:743)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:685)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1212)
at 
hudson.model.AbstractBuild$AbstractRunner.checkout(AbstractBuild.java:579)
at hudson.model.AbstractBuild$AbstractRunner.run(AbstractBuild.java:468)
at hudson.model.Run.run(Run.java:1410)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:238)
Caused by: java.io.IOException: Remote call on solaris1 failed
at hudson.remoting.Channel.call(Channel.java:690)
at hudson.FilePath.act(FilePath.java:821)
... 10 more
Caused by: java.lang.LinkageError: duplicate class definition: 
hudson/model/Descriptor
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
at java.lang.ClassLoader.defineClass(ClassLoader.java:466)
at 
hudson.remoting.RemoteClassLoader.loadClassFile(RemoteClassLoader.java:152)
at 
hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:131)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2259)
at java.lang.Class.getDeclaredField(Class.java:1852)
at 
java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1582)
at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:52)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:408)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.(ObjectStreamClass.java:400)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:297)
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:531)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1699)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
at hudson.remoting.UserRequest.deserialize(UserRequest.java:182)
at hudson.remoting.UserRequest.perform(UserRequest.java:98)
at hudson.remoting.UserRequest.perform(UserRequest.java:48)
at hudson.remoting.Request$2.run(Request.java:287)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
at java.util.concurrent.FutureTask.run(FutureTask.java:123)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:651)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:676)
at java.lang.Thread.run(Thread.java:595)
Retrying after 10 seconds
hudson.uti

[jira] [Commented] (NUTCH-809) Parse-metatags plugin

2012-03-22 Thread Julien Nioche (Commented) (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13235468#comment-13235468
 ] 

Julien Nioche commented on NUTCH-809:
-

Hi Lewis

bq. Can you confirm what you would like to see added to the wiki?, I will try 
my best to get this added, are you referring to the [0]? 

Nope. I meant replacing the wiki page written by Elizabeth with instructions on 
what to do to get the metatags parsed and indexed. What I committed relies on 
another plugin for indexing metadata whereas the old one had its own indexer 
etc...

bq. Also I thought the best thing to do regarding porting to Nutchgora is just 
to add it to the ever growing NUTCH-1104 list, so I have done so. If and when 
this is required over there someone can duly oblige

good thinking

bq. Regarding adding fields to Solr I assume you mean schema and 
solr-mapping.xml?

yes, this will be needed if we want this to be on by default which I think is a 
good idea

bq. Finally can you expand on 'activate by default', what exactly is it that 
not activated by default? I read your README.txt but I can see any mention of 
it in there.

Plugins have to be listed in plugin.includes in order to be used. Thinking 
about it it would be good to declare a dependency to index-metatags so that the 
later is activated automatically (assuming plugin.auto-activation = true) 

Thanks

Julien


> Parse-metatags plugin
> -
>
> Key: NUTCH-809
> URL: https://issues.apache.org/jira/browse/NUTCH-809
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Affects Versions: 1.4, nutchgora
>Reporter: Julien Nioche
>Assignee: Julien Nioche
> Fix For: 1.5
>
> Attachments: NUTCH-809-trunk.patch, NUTCH-809.patch, 
> NUTCH-809_metatags_1.3.patch, metatags-plugin+tutorial.zip
>
>
> h2. Parse-metatags plugin
> The parse-metatags plugin consists of a HTMLParserFilter which takes as 
> parameter a list of metatag names with '*' as default value. The values are 
> separated by ';'.
> In order to extract the values of the metatags description and keywords, you 
> must specify in nutch-site.xml
> {code:xml}
> 
>   metatags.names
>   description;keywords
> 
> {code}
> The MetatagIndexer uses the output of the parsing above to create two fields 
> 'keywords' and 'description'. Note that keywords is multivalued.
> The query-basic plugin is used to include these fields in the search e.g. in 
> nutch-site.xml
> {code:xml}
> 
>   query.basic.description.boost
>   2.0
> 
> 
>   query.basic.keywords.boost
>   2.0
> 
> {code}
> This code has been developed by DigitalPebble Ltd and offered to the 
> community by ANT.com

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1319) HostNormalizer

2012-03-22 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1319:
-

Patch Info: Patch Available

> HostNormalizer
> --
>
> Key: NUTCH-1319
> URL: https://issues.apache.org/jira/browse/NUTCH-1319
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.5
>
> Attachments: NUTCH-1319-1.5-1.patch
>
>
> Nutch would benefit from having a host normalizer. A host normalizer maps a 
> given host to the desired host. A basic example is to map www.apache.org to 
> apache.org. The Apache website is one of many on the internet that has a 
> duplicate website on the same domain just because it allows both www and 
> non-www to return HTTP 200 and proper content.
> It is also able to handle wildcards such as *.example.org to example.org if 
> there are multiple sub domains that actually point to the same website.
> Large internet crawls tend to get polluted very quickly due to these 
> problems. It also leads to skewed scores in the webgraph as different 
> websites link to different versions of the same duplicate website.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




[jira] [Updated] (NUTCH-1319) HostNormalizer

2012-03-22 Thread Markus Jelsma (Updated) (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1319?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1319:
-

Attachment: NUTCH-1319-1.5-1.patch

Patch for 1.5.

> HostNormalizer
> --
>
> Key: NUTCH-1319
> URL: https://issues.apache.org/jira/browse/NUTCH-1319
> Project: Nutch
>  Issue Type: New Feature
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.5
>
> Attachments: NUTCH-1319-1.5-1.patch
>
>
> Nutch would benefit from having a host normalizer. A host normalizer maps a 
> given host to the desired host. A basic example is to map www.apache.org to 
> apache.org. The Apache website is one of many on the internet that has a 
> duplicate website on the same domain just because it allows both www and 
> non-www to return HTTP 200 and proper content.
> It is also able to handle wildcards such as *.example.org to example.org if 
> there are multiple sub domains that actually point to the same website.
> Large internet crawls tend to get polluted very quickly due to these 
> problems. It also leads to skewed scores in the webgraph as different 
> websites link to different versions of the same duplicate website.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira




Re: [jira] [Created] (NUTCH-1319) HostNormalizer

2012-03-22 Thread Mathijs Homminga
Hi Markus,

Right. I agree that a simple map implementation would help here.

Mathijs

On Mar 21, 2012, at 22:54 , Markus Jelsma wrote:

> Hi Mathijs,
> 
> We use this in the fetcher (parse=true) and when updating the CrawlDB and 
> with the free generator. We use it in the fetcher because we follow outlinks 
> and make sure we follow the desired host and in the CrawlDB because there we 
> update records for recently added host normalizer rules.
> 
> It is just an URL normalizer like the others but only changes the host part. 
> This is not covered in other standard normalizers. The BasicURLNormalizer 
> cannot do this and the RegexURLNormalizer is far too heavy to take 20MB of 
> expressions and harder to auto-generate. A simple map lookup is very fast.
> 
> Cheers,
> 
> On Wed, 21 Mar 2012 22:22:54 +0100, Mathijs Homminga 
>  wrote:
>> Hi Markus,
>> 
>> How (where in the process) do you like to use this normalizer. Isn't
>> this functionality already covered by the URL normalizer(s)?
>> 
>> Mathijs Homminga
>> 
>> On Mar 21, 2012, at 22:06, "Markus Jelsma (Created) (JIRA)"
>>  wrote:
>> 
>>> HostNormalizer
>>> --
>>> 
>>>Key: NUTCH-1319
>>>URL: https://issues.apache.org/jira/browse/NUTCH-1319
>>>Project: Nutch
>>> Issue Type: New Feature
>>>   Reporter: Markus Jelsma
>>>   Assignee: Markus Jelsma
>>>Fix For: 1.5
>>> 
>>> 
>>> Nutch would benefit from having a host normalizer. A host normalizer maps a 
>>> given host to the desired host. A basic example is to map www.apache.org to 
>>> apache.org. The Apache website is one of many on the internet that has a 
>>> duplicate website on the same domain just because it allows both www and 
>>> non-www to return HTTP 200 and proper content.
>>> 
>>> It is also able to handle wildcards such as *.example.org to example.org if 
>>> there are multiple sub domains that actually point to the same website.
>>> 
>>> Large internet crawls tend to get polluted very quickly due to these 
>>> problems. It also leads to skewed scores in the webgraph as different 
>>> websites link to different versions of the same duplicate website.
>>> 
>>> --
>>> This message is automatically generated by JIRA.
>>> If you think it was sent incorrectly, please contact your JIRA 
>>> administrators: 
>>> https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
>>> For more information on JIRA, see: http://www.atlassian.com/software/jira
>>> 
>>> 
>