Build failed in Jenkins: Nutch-nutchgora #403

2012-11-13 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-nutchgora/403/

--
Started by timer
Building remotely on solaris1 in workspace 
https://builds.apache.org/job/Nutch-nutchgora/ws/
hudson.util.IOException2: remote file operation failed: 
https://builds.apache.org/job/Nutch-nutchgora/ws/ at 
hudson.remoting.Channel@1ea860fb:solaris1
at hudson.FilePath.act(FilePath.java:838)
at hudson.FilePath.act(FilePath.java:824)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:743)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:685)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1256)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:589)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:88)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:494)
at hudson.model.Run.execute(Run.java:1502)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:236)
Caused by: java.io.IOException: Remote call on solaris1 failed
at hudson.remoting.Channel.call(Channel.java:673)
at hudson.FilePath.act(FilePath.java:831)
... 11 more
Caused by: java.lang.LinkageError: duplicate class definition: 
hudson/model/Descriptor
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
at java.lang.ClassLoader.defineClass(ClassLoader.java:466)
at 
hudson.remoting.RemoteClassLoader.loadClassFile(RemoteClassLoader.java:152)
at 
hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:131)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2259)
at java.lang.Class.getDeclaredField(Class.java:1852)
at 
java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1582)
at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:52)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:408)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.init(ObjectStreamClass.java:400)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:297)
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:531)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1699)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
at hudson.remoting.UserRequest.deserialize(UserRequest.java:182)
at hudson.remoting.UserRequest.perform(UserRequest.java:98)
at hudson.remoting.UserRequest.perform(UserRequest.java:48)
at hudson.remoting.Request$2.run(Request.java:326)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
at java.util.concurrent.FutureTask.run(FutureTask.java:123)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:651)
at 

Build failed in Jenkins: Nutch-trunk #2013

2012-11-13 Thread Apache Jenkins Server
See https://builds.apache.org/job/Nutch-trunk/2013/

--
Started by timer
Building remotely on solaris1 in workspace 
https://builds.apache.org/job/Nutch-trunk/ws/
hudson.util.IOException2: remote file operation failed: 
https://builds.apache.org/job/Nutch-trunk/ws/ at 
hudson.remoting.Channel@1ea860fb:solaris1
at hudson.FilePath.act(FilePath.java:838)
at hudson.FilePath.act(FilePath.java:824)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:743)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:685)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1256)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:589)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:88)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:494)
at hudson.model.Run.execute(Run.java:1502)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:46)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:236)
Caused by: java.io.IOException: Remote call on solaris1 failed
at hudson.remoting.Channel.call(Channel.java:673)
at hudson.FilePath.act(FilePath.java:831)
... 11 more
Caused by: java.lang.LinkageError: duplicate class definition: 
hudson/model/Descriptor
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
at java.lang.ClassLoader.defineClass(ClassLoader.java:466)
at 
hudson.remoting.RemoteClassLoader.loadClassFile(RemoteClassLoader.java:152)
at 
hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:131)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2259)
at java.lang.Class.getDeclaredField(Class.java:1852)
at 
java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1582)
at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:52)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:408)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.init(ObjectStreamClass.java:400)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:297)
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:531)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1699)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
at hudson.remoting.UserRequest.deserialize(UserRequest.java:182)
at hudson.remoting.UserRequest.perform(UserRequest.java:98)
at hudson.remoting.UserRequest.perform(UserRequest.java:48)
at hudson.remoting.Request$2.run(Request.java:326)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
at java.util.concurrent.FutureTask.run(FutureTask.java:123)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:651)
at 

Build failed in Jenkins: nutch-trunk-maven #492

2012-11-13 Thread Apache Jenkins Server
See https://builds.apache.org/job/nutch-trunk-maven/492/

--
Started by timer
Building remotely on solaris1 in workspace 
https://builds.apache.org/job/nutch-trunk-maven/ws/
hudson.util.IOException2: remote file operation failed: 
https://builds.apache.org/job/nutch-trunk-maven/ws/ at 
hudson.remoting.Channel@1ea860fb:solaris1
at hudson.FilePath.act(FilePath.java:838)
at hudson.FilePath.act(FilePath.java:824)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:743)
at hudson.scm.SubversionSCM.checkout(SubversionSCM.java:685)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1256)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:589)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:88)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:494)
at hudson.model.Run.execute(Run.java:1502)
at hudson.maven.MavenModuleSetBuild.run(MavenModuleSetBuild.java:477)
at hudson.model.ResourceController.execute(ResourceController.java:88)
at hudson.model.Executor.run(Executor.java:236)
Caused by: java.io.IOException: Remote call on solaris1 failed
at hudson.remoting.Channel.call(Channel.java:673)
at hudson.FilePath.act(FilePath.java:831)
... 11 more
Caused by: java.lang.LinkageError: duplicate class definition: 
hudson/model/Descriptor
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:621)
at java.lang.ClassLoader.defineClass(ClassLoader.java:466)
at 
hudson.remoting.RemoteClassLoader.loadClassFile(RemoteClassLoader.java:152)
at 
hudson.remoting.RemoteClassLoader.findClass(RemoteClassLoader.java:131)
at java.lang.ClassLoader.loadClass(ClassLoader.java:307)
at java.lang.ClassLoader.loadClass(ClassLoader.java:252)
at java.lang.ClassLoader.loadClassInternal(ClassLoader.java:320)
at java.lang.Class.getDeclaredFields0(Native Method)
at java.lang.Class.privateGetDeclaredFields(Class.java:2259)
at java.lang.Class.getDeclaredField(Class.java:1852)
at 
java.io.ObjectStreamClass.getDeclaredSUID(ObjectStreamClass.java:1582)
at java.io.ObjectStreamClass.access$700(ObjectStreamClass.java:52)
at java.io.ObjectStreamClass$2.run(ObjectStreamClass.java:408)
at java.security.AccessController.doPrivileged(Native Method)
at java.io.ObjectStreamClass.init(ObjectStreamClass.java:400)
at java.io.ObjectStreamClass.lookup(ObjectStreamClass.java:297)
at java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:531)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466)
at 
java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1552)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1466)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1699)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at 
java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:1910)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:1834)
at 
java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1719)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1305)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:348)
at hudson.remoting.UserRequest.deserialize(UserRequest.java:182)
at hudson.remoting.UserRequest.perform(UserRequest.java:98)
at hudson.remoting.UserRequest.perform(UserRequest.java:48)
at hudson.remoting.Request$2.run(Request.java:326)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:72)
at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
at java.util.concurrent.FutureTask.run(FutureTask.java:123)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:651)
at 

[jira] [Updated] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL

2012-11-13 Thread James Sullivan (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Sullivan updated NUTCH-1497:
--

Patch Info:   (was: Patch Available)

 Better default gora-sql-mapping.xml with larger field sizes for MySQL
 -

 Key: NUTCH-1497
 URL: https://issues.apache.org/jira/browse/NUTCH-1497
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: 2.2
 Environment: MySQL Backend
Reporter: James Sullivan
Priority: Minor
  Labels: MySQL
 Attachments: gora-mysql-mapping.xml


 The current generic default gora-sql-mapping.xml has field sizes that are too 
 small in almost all situations when used with MySQL. I have included a 
 mapping which will work better for MySQL (takes slightly more space but will 
 be able to handle larger fields necessary for real world use). Includes patch 
 from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it 
 is not possible to use the same gora-sql-mapping for both hsqldb and MySQL 
 without a significantly degraded lowest common denominator resulting. Should 
 the user manually rename the attached file to gora-sql-mapping.xml or is 
 there a way to have Nutch automatically use it when MySQL is selected in 
 other configurations (Ivy.xml or gora.properties)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL

2012-11-13 Thread James Sullivan (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

James Sullivan updated NUTCH-1497:
--

Attachment: gora-mysql-mapping.xml

 Better default gora-sql-mapping.xml with larger field sizes for MySQL
 -

 Key: NUTCH-1497
 URL: https://issues.apache.org/jira/browse/NUTCH-1497
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: 2.2
 Environment: MySQL Backend
Reporter: James Sullivan
Priority: Minor
  Labels: MySQL
 Attachments: gora-mysql-mapping.xml, gora-mysql-mapping.xml


 The current generic default gora-sql-mapping.xml has field sizes that are too 
 small in almost all situations when used with MySQL. I have included a 
 mapping which will work better for MySQL (takes slightly more space but will 
 be able to handle larger fields necessary for real world use). Includes patch 
 from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it 
 is not possible to use the same gora-sql-mapping for both hsqldb and MySQL 
 without a significantly degraded lowest common denominator resulting. Should 
 the user manually rename the attached file to gora-sql-mapping.xml or is 
 there a way to have Nutch automatically use it when MySQL is selected in 
 other configurations (Ivy.xml or gora.properties)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL

2012-11-13 Thread James Sullivan (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496122#comment-13496122
 ] 

James Sullivan commented on NUTCH-1497:
---

Nathan I've made the changes to the lengths and uploaded. Could you check it is 
correct. One note I left the column as typ, as although I agree it is odd, I 
thought consistency was more important.

 Better default gora-sql-mapping.xml with larger field sizes for MySQL
 -

 Key: NUTCH-1497
 URL: https://issues.apache.org/jira/browse/NUTCH-1497
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: 2.2
 Environment: MySQL Backend
Reporter: James Sullivan
Priority: Minor
  Labels: MySQL
 Attachments: gora-mysql-mapping.xml, gora-mysql-mapping.xml


 The current generic default gora-sql-mapping.xml has field sizes that are too 
 small in almost all situations when used with MySQL. I have included a 
 mapping which will work better for MySQL (takes slightly more space but will 
 be able to handle larger fields necessary for real world use). Includes patch 
 from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it 
 is not possible to use the same gora-sql-mapping for both hsqldb and MySQL 
 without a significantly degraded lowest common denominator resulting. Should 
 the user manually rename the attached file to gora-sql-mapping.xml or is 
 there a way to have Nutch automatically use it when MySQL is selected in 
 other configurations (Ivy.xml or gora.properties)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Commented] (NUTCH-1497) Better default gora-sql-mapping.xml with larger field sizes for MySQL

2012-11-13 Thread James Sullivan (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13496131#comment-13496131
 ] 

James Sullivan commented on NUTCH-1497:
---

I agree one standard file for SQL databases would be preferable but one example 
of why I couldn't stay with one file for both hsql and MySQL is the text column 
was being turned into a blob, not text at larger sizes. 

 Better default gora-sql-mapping.xml with larger field sizes for MySQL
 -

 Key: NUTCH-1497
 URL: https://issues.apache.org/jira/browse/NUTCH-1497
 Project: Nutch
  Issue Type: Improvement
  Components: storage
Affects Versions: 2.2
 Environment: MySQL Backend
Reporter: James Sullivan
Priority: Minor
  Labels: MySQL
 Attachments: gora-mysql-mapping.xml, gora-mysql-mapping.xml


 The current generic default gora-sql-mapping.xml has field sizes that are too 
 small in almost all situations when used with MySQL. I have included a 
 mapping which will work better for MySQL (takes slightly more space but will 
 be able to handle larger fields necessary for real world use). Includes patch 
 from Nutch-1490 and resolves the non-Unicode part of Nutch-1473. I believe it 
 is not possible to use the same gora-sql-mapping for both hsqldb and MySQL 
 without a significantly degraded lowest common denominator resulting. Should 
 the user manually rename the attached file to gora-sql-mapping.xml or is 
 there a way to have Nutch automatically use it when MySQL is selected in 
 other configurations (Ivy.xml or gora.properties)?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1495) -normalize and -filter for updatedb command in nutch 2.x

2012-11-13 Thread Nathan Gass (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1495?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Nathan Gass updated NUTCH-1495:
---

Attachment: patch-updatedb-normalize-filter-2012-11-13.txt

The attached patch shows where I'm currently standing.

normalize basically works and possible duplicate entries are handled similar to 
nutch 1.x (by taking the newest one).

I'm not at all sure if this is enough/the best approach. Currently fields like 
baseUrl are not changed. Should DbUpdater try to adapt them to the new url (by 
doing the same normalizations)? What about the fetched content? Another 
approach could be to add a new empty entry, so updatedb -normalize would 
actually throw away already fetched and/or parsed content of urls with new 
normalizations.

More testing is also necessary, but I'm waiting for comments if this approach 
is at all feasible before I continue working on this.

 -normalize and -filter for updatedb command in nutch 2.x
 

 Key: NUTCH-1495
 URL: https://issues.apache.org/jira/browse/NUTCH-1495
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 2.2
Reporter: Nathan Gass
 Attachments: patch-updatedb-normalize-filter-2012-11-09.txt, 
 patch-updatedb-normalize-filter-2012-11-13.txt


 AFAIS in nutch 1.x you could change your url filters and normalizers during 
 the crawl, and update the db using crawldb -normalize -filter. There does not 
 seem to be a away to achieve the same in nutch 2.x?
 Anyway, I went ahead and tried to implement -normalize and -filter for the 
 nutch 2.x updatedb command. I have no experience with any of the used 
 technologies including java, so please check the attached code carefully 
 before using it. I'm very interested to hear if this is the right approach or 
 any other comments.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-11-13 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1370:


Attachment: NUTCH-1370-2.x-v2.patch

2nd WIP for 2.x I'm having difficulty correctly implementing JobClient#runJob 
as the currentJob param is not correct... 
{code}
RunningJob mapJob = JobClient.runJob(currentJob);
{code}

@Seb,
Regarding your patch, this looks great, is much cleaner than my proposal, I've 
tested and I'm +1 for committing.

 Expose exact number of urls injected @runtime 
 --

 Key: NUTCH-1370
 URL: https://issues.apache.org/jira/browse/NUTCH-1370
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6, 2.2

 Attachments: NUTCH-1370-1.x.patch, NUTCH-1370-2.x.patch, 
 NUTCH-1370-2.x-v2.patch


 Example: When using trunk, currently we see 
 {code}
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
 2012-05-22 09:04:00
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
 crawl/crawldb
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
 urls to crawl db entries.
 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
 {code}
 I would like to see
 {code}
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
 2012-05-22 09:04:00
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
 crawl/crawldb
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Injected N urls to 
 crawl/crawldb
 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
 urls to crawl db entries.
 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
 {code}
 This would make debugging easier and would help those who end up getting 
 {code}
 2012-05-22 09:04:04,850 WARN  crawl.Generator - Generator: 0 records selected 
 for fetching, exiting ...
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-11-13 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1370:


Patch Info: Patch Available

 Expose exact number of urls injected @runtime 
 --

 Key: NUTCH-1370
 URL: https://issues.apache.org/jira/browse/NUTCH-1370
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6, 2.2

 Attachments: NUTCH-1370-1.x.patch, NUTCH-1370-2.x.patch, 
 NUTCH-1370-2.x-v2.patch


 Example: When using trunk, currently we see 
 {code}
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
 2012-05-22 09:04:00
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
 crawl/crawldb
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
 urls to crawl db entries.
 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
 {code}
 I would like to see
 {code}
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
 2012-05-22 09:04:00
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
 crawl/crawldb
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Injected N urls to 
 crawl/crawldb
 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
 urls to crawl db entries.
 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
 {code}
 This would make debugging easier and would help those who end up getting 
 {code}
 2012-05-22 09:04:04,850 WARN  crawl.Generator - Generator: 0 records selected 
 for fetching, exiting ...
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Updated] (NUTCH-1117) JUnit test for index-anchor

2012-11-13 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney updated NUTCH-1117:


Attachment: NUTCH-1117.patch

Trivial patch fro tests case. Thank you to both Ferdy  Markus for the info on 
manually simulating Inlinks insertion.

 JUnit test for index-anchor
 ---

 Key: NUTCH-1117
 URL: https://issues.apache.org/jira/browse/NUTCH-1117
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6

 Attachments: NUTCH-1117.patch


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Resolved] (NUTCH-1117) JUnit test for index-anchor

2012-11-13 Thread Lewis John McGibbney (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1117?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lewis John McGibbney resolved NUTCH-1117.
-

Resolution: Fixed

Committed @revision 1408898 in trunk

 JUnit test for index-anchor
 ---

 Key: NUTCH-1117
 URL: https://issues.apache.org/jira/browse/NUTCH-1117
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.4
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6

 Attachments: NUTCH-1117.patch


 This issue is part of the larger attempt to provide a Junit test case for 
 every Nutch plugin.  

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


[jira] [Created] (NUTCH-1498) Make index-basic consistent in trunk and 2.x

2012-11-13 Thread Lewis John McGibbney (JIRA)
Lewis John McGibbney created NUTCH-1498:
---

 Summary: Make index-basic consistent in trunk and 2.x
 Key: NUTCH-1498
 URL: https://issues.apache.org/jira/browse/NUTCH-1498
 Project: Nutch
  Issue Type: Bug
  Components: indexer
Affects Versions: 2.2
Reporter: Lewis John McGibbney
Priority: Minor
 Fix For: 2.2


Currently the index-basic plugin supports more functionality in trunk than it 
does in 2.x. I see no reason why functionality shouldn't be made consistent.
For example 
- 2.x duplicates field values for host and site...
- trunk supports configuration options for indexer.add.domain and 
indexer.max.content.length whereas 2.x does not.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira


Build failed in Jenkins: nutch-trunk-maven #493

2012-11-13 Thread Apache Jenkins Server
See https://builds.apache.org/job/nutch-trunk-maven/493/

--
[...truncated 1190 lines...]
AU
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
AU
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/package.html
A src/plugin/protocol-httpclient/jsp
AUsrc/plugin/protocol-httpclient/jsp/ntlm.jsp
AUsrc/plugin/protocol-httpclient/jsp/cookies.jsp
AUsrc/plugin/protocol-httpclient/jsp/noauth.jsp
AUsrc/plugin/protocol-httpclient/jsp/digest.jsp
AUsrc/plugin/protocol-httpclient/jsp/basic.jsp
AUsrc/plugin/protocol-httpclient/plugin.xml
AUsrc/plugin/protocol-httpclient/build.xml
A src/plugin/parse-metatags
A src/plugin/parse-metatags/sample
A src/plugin/parse-metatags/sample/testMetatags.html
A src/plugin/parse-metatags/ivy.xml
A src/plugin/parse-metatags/src
A src/plugin/parse-metatags/src/test
A src/plugin/parse-metatags/src/test/org
A src/plugin/parse-metatags/src/test/org/apache
A src/plugin/parse-metatags/src/test/org/apache/nutch
A src/plugin/parse-metatags/src/test/org/apache/nutch/parse
A src/plugin/parse-metatags/src/test/org/apache/nutch/parse/html
A 
src/plugin/parse-metatags/src/test/org/apache/nutch/parse/html/TestMetatagParser.java
A src/plugin/parse-metatags/src/java
A src/plugin/parse-metatags/src/java/org
A src/plugin/parse-metatags/src/java/org/apache
A src/plugin/parse-metatags/src/java/org/apache/nutch
A src/plugin/parse-metatags/src/java/org/apache/nutch/parse
A 
src/plugin/parse-metatags/src/java/org/apache/nutch/parse/MetaTagsParser.java
A src/plugin/parse-metatags/README.txt
A src/plugin/parse-metatags/plugin.xml
A src/plugin/parse-metatags/build.xml
A src/plugin/urlfilter-domain
A src/plugin/urlfilter-domain/ivy.xml
A src/plugin/urlfilter-domain/src
A src/plugin/urlfilter-domain/src/test
A src/plugin/urlfilter-domain/src/test/org
A src/plugin/urlfilter-domain/src/test/org/apache
A src/plugin/urlfilter-domain/src/test/org/apache/nutch
A src/plugin/urlfilter-domain/src/test/org/apache/nutch/urlfilter
A src/plugin/urlfilter-domain/src/test/org/apache/nutch/urlfilter/domain
AU
src/plugin/urlfilter-domain/src/test/org/apache/nutch/urlfilter/domain/TestDomainURLFilter.java
A src/plugin/urlfilter-domain/src/java
A src/plugin/urlfilter-domain/src/java/org
A src/plugin/urlfilter-domain/src/java/org/apache
A src/plugin/urlfilter-domain/src/java/org/apache/nutch
A src/plugin/urlfilter-domain/src/java/org/apache/nutch/urlfilter
A src/plugin/urlfilter-domain/src/java/org/apache/nutch/urlfilter/domain
AU
src/plugin/urlfilter-domain/src/java/org/apache/nutch/urlfilter/domain/DomainURLFilter.java
AU
src/plugin/urlfilter-domain/src/java/org/apache/nutch/urlfilter/domain/package.html
A src/plugin/urlfilter-domain/data
AUsrc/plugin/urlfilter-domain/data/hosts.txt
AUsrc/plugin/urlfilter-domain/plugin.xml
AUsrc/plugin/urlfilter-domain/build.xml
A src/plugin/protocol-http
A src/plugin/protocol-http/ivy.xml
A src/plugin/protocol-http/src
A src/plugin/protocol-http/src/test
A src/plugin/protocol-http/src/test/org
A src/plugin/protocol-http/src/test/org/apache
A src/plugin/protocol-http/src/test/org/apache/nutch
A src/plugin/protocol-http/src/test/org/apache/nutch/protocol
A src/plugin/protocol-http/src/test/org/apache/nutch/protocol/http
A src/plugin/protocol-http/src/java
A src/plugin/protocol-http/src/java/org
A src/plugin/protocol-http/src/java/org/apache
A src/plugin/protocol-http/src/java/org/apache/nutch
A src/plugin/protocol-http/src/java/org/apache/nutch/protocol
A src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http
AU
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/Http.java
AU
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/HttpResponse.java
AU
src/plugin/protocol-http/src/java/org/apache/nutch/protocol/http/package.html
AUsrc/plugin/protocol-http/plugin.xml
AUsrc/plugin/protocol-http/build.xml
A pom.xml
A KEYS
AUREADME.txt
AUbuild.xml
 U.
At revision 1408944
no revision recorded for https://svn.apache.org/repos/asf/nutch/trunk in the 
previous build
Parsing POMs
[trunk] $ /home/hudson/tools/java/latest1.6/bin/java -cp 
/export/home/hudson/hudson-slave/maven-agent.jar:/export/home/hudson/hudson-slave/classworlds.jar
 hudson.maven.agent.Main /home/hudson/tools/maven/latest 
/zonestorage/hudson_solaris/home/hudson/hudson-slave/slave.jar 

[jira] [Updated] (NUTCH-1370) Expose exact number of urls injected @runtime

2012-11-13 Thread Sebastian Nagel (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-1370:
---

Attachment: NUTCH-1370-2.x-v3.patch

Hi Lewis, yes, the 1.x patch is not easily transferred for 2.x because of 
different (old vs. new) map reduce APIs. Here is a trial...
One question: the logged line number of urls attempting to inject suggests 
that there is a third count urls successfully injected or similar. What's the 
intention with attempting?


 Expose exact number of urls injected @runtime 
 --

 Key: NUTCH-1370
 URL: https://issues.apache.org/jira/browse/NUTCH-1370
 Project: Nutch
  Issue Type: Improvement
  Components: injector
Affects Versions: nutchgora, 1.5
Reporter: Lewis John McGibbney
Assignee: Lewis John McGibbney
Priority: Minor
 Fix For: 1.6, 2.2

 Attachments: NUTCH-1370-1.x.patch, NUTCH-1370-2.x.patch, 
 NUTCH-1370-2.x-v2.patch, NUTCH-1370-2.x-v3.patch


 Example: When using trunk, currently we see 
 {code}
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
 2012-05-22 09:04:00
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
 crawl/crawldb
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
 urls to crawl db entries.
 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
 {code}
 I would like to see
 {code}
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: starting at 
 2012-05-22 09:04:00
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: crawlDb: 
 crawl/crawldb
 2012-05-22 09:04:00,239 INFO  crawl.Injector - Injector: urlDir: urls
 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Injected N urls to 
 crawl/crawldb
 2012-05-22 09:04:00,253 INFO  crawl.Injector - Injector: Converting injected 
 urls to crawl db entries.
 2012-05-22 09:04:00,955 INFO  plugin.PluginRepository - Plugins: looking in:
 {code}
 This would make debugging easier and would help those who end up getting 
 {code}
 2012-05-22 09:04:04,850 WARN  crawl.Generator - Generator: 0 records selected 
 for fetching, exiting ...
 {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira