[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102322#comment-16102322
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

simoncpu commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-318189750
 
 
   Will try this patch while waiting for it to be merged into the official 
repo... thanks, man! :)
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-07-26 Thread Hudson (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16102106#comment-16102106
 ] 

Hudson commented on NUTCH-2368:
---

SUCCESS: Integrated in Jenkins build Nutch-trunk #3446 (See 
[https://builds.apache.org/job/Nutch-trunk/3446/])
NUTCH-2368 Variable generate.max.count and fetcher.server.delay (markus: 
[https://github.com/apache/nutch/commit/44f7ad973f2017bacde2bf5277f846179eafc6dd])
* (edit) src/java/org/apache/nutch/fetcher/FetchItemQueue.java
* (edit) conf/nutch-default.xml
* (edit) src/java/org/apache/nutch/crawl/Generator.java


> Variable generate.max.count and fetcher.server.delay
> 
>
> Key: NUTCH-2368
> URL: https://issues.apache.org/jira/browse/NUTCH-2368
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch
>
>
> In some cases we need to use host specific characteristics in determining 
> crawl speed and bulk sizes because with our (Openindex) settings we can just 
> recrawl host with up to 800k urls.
> This patch solves the problem by introducing the HostDB to the Generator and 
> providing powerful Jexl expressions. Check these two expressions added to the 
> Generator:
> {code}
> -Dgenerate.max.count.expr='
> if (unfetched + fetched > 80) {
>   return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 
> 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1)
> } else {
>   return conf.getDouble("generate.max.count", 300);
> }'
> -Dgenerate.fetch.delay.expr='
> if (unfetched + fetched > 80) {
>   return (pct95._rs_ + 500);
> } else {
>   return conf.getDouble("fetcher.server.delay", 1000)
> }'
> {code}
> For each large host: select as many records as possible that are possible to 
> fetch based on number of threads, 95th percentile response time of the fetch 
> limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads.
> The second expression just follows up to that, settings the crawlDelay of the 
> fetch queue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Jenkins build is back to normal : Nutch-trunk #3446

2017-07-26 Thread Apache Jenkins Server
See 




Build failed in Jenkins: Nutch-trunk #3445

2017-07-26 Thread Apache Jenkins Server
See 

--
Started by an SCM change
[EnvInject] - Loading node environment variables.
Building remotely on cassandra14 (cassandra ubuntu) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/nutch.git # timeout=10
Fetching upstream changes from https://github.com/apache/nutch.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/nutch.git 
 > +refs/heads/*:refs/remotes/origin/*
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from 
https://github.com/apache/nutch.git
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:812)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1079)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1110)
at hudson.scm.SCM.checkout(SCM.java:495)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1276)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:560)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:485)
at hudson.model.Run.execute(Run.java:1735)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:405)
Caused by: hudson.plugins.git.GitException: Command "git fetch --tags 
--progress https://github.com/apache/nutch.git 
+refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: fatal: unable to access 'https://github.com/apache/nutch.git/': Could 
not resolve host: github.com

at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1903)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1622)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$300(CliGitAPIImpl.java:71)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl$1.execute(CliGitAPIImpl.java:348)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:153)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:146)
at hudson.remoting.UserRequest.perform(UserRequest.java:153)
at hudson.remoting.UserRequest.perform(UserRequest.java:50)
at hudson.remoting.Request$2.run(Request.java:336)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
at ..remote call to cassandra14(Native Method)
at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1545)
at hudson.remoting.UserResponse.retrieve(UserRequest.java:253)
at hudson.remoting.Channel.call(Channel.java:830)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.execute(RemoteGitImpl.java:146)
at sun.reflect.GeneratedMethodAccessor864.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.invoke(RemoteGitImpl.java:132)
at com.sun.proxy.$Proxy104.execute(Unknown Source)
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:810)
... 11 more
ERROR: Error fetching remote repo 'origin'
Retrying after 10 seconds
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/nutch.git # timeout=10
Fetching upstream changes from https://github.com/apache/nutch.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/nutch.git 
 > +refs/heads/*:refs/remotes/origin/*
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from 
https://github.com/apache/nutch.git
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:812)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1079)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1110)
at hudson.scm.SCM.checkout(SCM.java:495)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1276)
at 

Build failed in Jenkins: Nutch-trunk #3444

2017-07-26 Thread Apache Jenkins Server
See 

--
Started by an SCM change
[EnvInject] - Loading node environment variables.
Building remotely on cassandra14 (cassandra ubuntu) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/nutch.git # timeout=10
Fetching upstream changes from https://github.com/apache/nutch.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/nutch.git 
 > +refs/heads/*:refs/remotes/origin/*
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from 
https://github.com/apache/nutch.git
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:812)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1079)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1110)
at hudson.scm.SCM.checkout(SCM.java:495)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1276)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:560)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:485)
at hudson.model.Run.execute(Run.java:1735)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:405)
Caused by: hudson.plugins.git.GitException: Command "git fetch --tags 
--progress https://github.com/apache/nutch.git 
+refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: fatal: unable to access 'https://github.com/apache/nutch.git/': Could 
not resolve host: github.com

at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1903)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1622)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$300(CliGitAPIImpl.java:71)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl$1.execute(CliGitAPIImpl.java:348)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:153)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:146)
at hudson.remoting.UserRequest.perform(UserRequest.java:153)
at hudson.remoting.UserRequest.perform(UserRequest.java:50)
at hudson.remoting.Request$2.run(Request.java:336)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
at ..remote call to cassandra14(Native Method)
at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1545)
at hudson.remoting.UserResponse.retrieve(UserRequest.java:253)
at hudson.remoting.Channel.call(Channel.java:830)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.execute(RemoteGitImpl.java:146)
at sun.reflect.GeneratedMethodAccessor864.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.invoke(RemoteGitImpl.java:132)
at com.sun.proxy.$Proxy104.execute(Unknown Source)
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:810)
... 11 more
ERROR: Error fetching remote repo 'origin'
Retrying after 10 seconds
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/nutch.git # timeout=10
Fetching upstream changes from https://github.com/apache/nutch.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/nutch.git 
 > +refs/heads/*:refs/remotes/origin/*
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from 
https://github.com/apache/nutch.git
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:812)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1079)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1110)
at hudson.scm.SCM.checkout(SCM.java:495)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1276)
at 

Build failed in Jenkins: Nutch-trunk #3443

2017-07-26 Thread Apache Jenkins Server
See 

--
Started by an SCM change
[EnvInject] - Loading node environment variables.
Building remotely on cassandra14 (cassandra ubuntu) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/nutch.git # timeout=10
Fetching upstream changes from https://github.com/apache/nutch.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/nutch.git 
 > +refs/heads/*:refs/remotes/origin/*
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from 
https://github.com/apache/nutch.git
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:812)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1079)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1110)
at hudson.scm.SCM.checkout(SCM.java:495)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1276)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:560)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:485)
at hudson.model.Run.execute(Run.java:1735)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:405)
Caused by: hudson.plugins.git.GitException: Command "git fetch --tags 
--progress https://github.com/apache/nutch.git 
+refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: fatal: unable to access 'https://github.com/apache/nutch.git/': Could 
not resolve host: github.com

at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1903)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1622)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$300(CliGitAPIImpl.java:71)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl$1.execute(CliGitAPIImpl.java:348)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:153)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:146)
at hudson.remoting.UserRequest.perform(UserRequest.java:153)
at hudson.remoting.UserRequest.perform(UserRequest.java:50)
at hudson.remoting.Request$2.run(Request.java:336)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
at ..remote call to cassandra14(Native Method)
at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1545)
at hudson.remoting.UserResponse.retrieve(UserRequest.java:253)
at hudson.remoting.Channel.call(Channel.java:830)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.execute(RemoteGitImpl.java:146)
at sun.reflect.GeneratedMethodAccessor864.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.invoke(RemoteGitImpl.java:132)
at com.sun.proxy.$Proxy104.execute(Unknown Source)
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:810)
... 11 more
ERROR: Error fetching remote repo 'origin'
Retrying after 10 seconds
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/nutch.git # timeout=10
Fetching upstream changes from https://github.com/apache/nutch.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/nutch.git 
 > +refs/heads/*:refs/remotes/origin/*
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from 
https://github.com/apache/nutch.git
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:812)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1079)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1110)
at hudson.scm.SCM.checkout(SCM.java:495)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1276)
at 

[jira] [Commented] (NUTCH-1129) Any23 Nutch plugin

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1129?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101795#comment-16101795
 ] 

ASF GitHub Bot commented on NUTCH-1129:
---

lewismc commented on issue #205: WIP: NUTCH-1129 microdata for Nutch 1.x
URL: https://github.com/apache/nutch/pull/205#issuecomment-318087705
 
 
   Hi @thilohaas this patch is too large for us to merge into Nutch master 
branch...
   Can you please separate our your code to implement Microdata support? We can 
then review that patch alone.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Any23 Nutch plugin
> --
>
> Key: NUTCH-1129
> URL: https://issues.apache.org/jira/browse/NUTCH-1129
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Minor
> Fix For: 2.5
>
> Attachments: NUTCH-1129.patch
>
>
> This plugin should build on the Any23 library to provide us with a plugin 
> which extracts RDF data from HTTP and file resources. Although as of writing 
> Any23 not part of the ASF, the project is working towards integration into 
> the Apache Incubator. Once the project proves its value, this would be an 
> excellent addition to the Nutch 1.X codebase. 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2400) Solr 6.6.0 compatibility

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2400?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101788#comment-16101788
 ] 

ASF GitHub Bot commented on NUTCH-2400:
---

lewismc commented on issue #201: NUTCH-2400 Solr 6.6.0 compatibility
URL: https://github.com/apache/nutch/pull/201#issuecomment-318087147
 
 
   Any comments folks?
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Solr 6.6.0 compatibility
> 
>
> Key: NUTCH-2400
> URL: https://issues.apache.org/jira/browse/NUTCH-2400
> Project: Nutch
>  Issue Type: Improvement
>  Components: indexer
>Affects Versions: 1.13
> Environment: Nutch 1.14-SNAPSHOT Solr 6.6.0
>Reporter: Lewis John McGibbney
>Assignee: Lewis John McGibbney
>Priority: Blocker
> Fix For: 1.14
>
> Attachments: managed-schema
>
>
> This issue relates to following mailing list thread 
> http://www.mail-archive.com/user%40nutch.apache.org/msg15574.html
> The schema.xml upgrade works with Solr 6.6.0, please try it out and let me 
> know how things go.
> I've also updated the tutorial at https://wiki.apache.org/nutch/NutchTutorial 
> so please check that out as well.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101785#comment-16101785
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

lewismc closed pull request #195: NUTCH-1465 Support sitemaps in Nutch
URL: https://github.com/apache/nutch/pull/195
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101783#comment-16101783
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

lewismc commented on issue #202: NUTCH-1465 Support for sitemaps
URL: https://github.com/apache/nutch/pull/202#issuecomment-318086684
 
 
   @sebastian-nagel, Markus' patch made it into master branch... is this 
correct? If so then we can close this issue.
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-1465) Support sitemaps in Nutch

2017-07-26 Thread ASF GitHub Bot (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101784#comment-16101784
 ] 

ASF GitHub Bot commented on NUTCH-1465:
---

lewismc closed pull request #189: NUTCH-1465 Support sitemaps in Nutch
URL: https://github.com/apache/nutch/pull/189
 
 
   
 

This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Support sitemaps in Nutch
> -
>
> Key: NUTCH-1465
> URL: https://issues.apache.org/jira/browse/NUTCH-1465
> Project: Nutch
>  Issue Type: New Feature
>  Components: parser
>Reporter: Lewis John McGibbney
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-1465.patch, NUTCH-1465.patch, NUTCH-1465.patch, 
> NUTCH-1465.patch, NUTCH-1465-sitemapinjector-trunk-v1.patch, 
> NUTCH-1465-trunk.v1.patch, NUTCH-1465-trunk.v2.patch, 
> NUTCH-1465-trunk.v3.patch, NUTCH-1465-trunk.v4.patch, 
> NUTCH-1465-trunk.v5.patch
>
>
> I recently came across this rather stagnant codebase[0] which is ASL v2.0 
> licensed and appears to have been used successfully to parse sitemaps as per 
> the discussion here[1].
> [0] http://sourceforge.net/projects/sitemap-parser/
> [1] 
> http://lucene.472066.n3.nabble.com/Support-for-Sitemap-Protocol-and-Canonical-URLs-td630060.html



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


Build failed in Jenkins: Nutch-trunk #3442

2017-07-26 Thread Apache Jenkins Server
See 

--
Started by an SCM change
[EnvInject] - Loading node environment variables.
Building remotely on cassandra14 (cassandra ubuntu) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/nutch.git # timeout=10
Fetching upstream changes from https://github.com/apache/nutch.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/nutch.git 
 > +refs/heads/*:refs/remotes/origin/*
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from 
https://github.com/apache/nutch.git
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:812)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1079)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1110)
at hudson.scm.SCM.checkout(SCM.java:495)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1276)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:560)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:485)
at hudson.model.Run.execute(Run.java:1735)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:405)
Caused by: hudson.plugins.git.GitException: Command "git fetch --tags 
--progress https://github.com/apache/nutch.git 
+refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: fatal: unable to access 'https://github.com/apache/nutch.git/': Could 
not resolve host: github.com

at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1903)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1622)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$300(CliGitAPIImpl.java:71)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl$1.execute(CliGitAPIImpl.java:348)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:153)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:146)
at hudson.remoting.UserRequest.perform(UserRequest.java:153)
at hudson.remoting.UserRequest.perform(UserRequest.java:50)
at hudson.remoting.Request$2.run(Request.java:336)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
at ..remote call to cassandra14(Native Method)
at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1545)
at hudson.remoting.UserResponse.retrieve(UserRequest.java:253)
at hudson.remoting.Channel.call(Channel.java:830)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.execute(RemoteGitImpl.java:146)
at sun.reflect.GeneratedMethodAccessor864.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.invoke(RemoteGitImpl.java:132)
at com.sun.proxy.$Proxy104.execute(Unknown Source)
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:810)
... 11 more
ERROR: Error fetching remote repo 'origin'
Retrying after 10 seconds
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/nutch.git # timeout=10
Fetching upstream changes from https://github.com/apache/nutch.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/nutch.git 
 > +refs/heads/*:refs/remotes/origin/*
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from 
https://github.com/apache/nutch.git
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:812)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1079)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1110)
at hudson.scm.SCM.checkout(SCM.java:495)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1276)
at 

Build failed in Jenkins: Nutch-trunk #3441

2017-07-26 Thread Apache Jenkins Server
See 

--
Started by an SCM change
[EnvInject] - Loading node environment variables.
Building remotely on cassandra14 (cassandra ubuntu) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/nutch.git # timeout=10
Fetching upstream changes from https://github.com/apache/nutch.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/nutch.git 
 > +refs/heads/*:refs/remotes/origin/*
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from 
https://github.com/apache/nutch.git
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:812)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1079)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1110)
at hudson.scm.SCM.checkout(SCM.java:495)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1276)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:560)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:485)
at hudson.model.Run.execute(Run.java:1735)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:405)
Caused by: hudson.plugins.git.GitException: Command "git fetch --tags 
--progress https://github.com/apache/nutch.git 
+refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: fatal: unable to access 'https://github.com/apache/nutch.git/': Could 
not resolve host: github.com

at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1903)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1622)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$300(CliGitAPIImpl.java:71)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl$1.execute(CliGitAPIImpl.java:348)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:153)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:146)
at hudson.remoting.UserRequest.perform(UserRequest.java:153)
at hudson.remoting.UserRequest.perform(UserRequest.java:50)
at hudson.remoting.Request$2.run(Request.java:336)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
at ..remote call to cassandra14(Native Method)
at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1545)
at hudson.remoting.UserResponse.retrieve(UserRequest.java:253)
at hudson.remoting.Channel.call(Channel.java:830)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.execute(RemoteGitImpl.java:146)
at sun.reflect.GeneratedMethodAccessor864.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.invoke(RemoteGitImpl.java:132)
at com.sun.proxy.$Proxy104.execute(Unknown Source)
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:810)
... 11 more
ERROR: Error fetching remote repo 'origin'
Retrying after 10 seconds
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/nutch.git # timeout=10
Fetching upstream changes from https://github.com/apache/nutch.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/nutch.git 
 > +refs/heads/*:refs/remotes/origin/*
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from 
https://github.com/apache/nutch.git
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:812)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1079)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1110)
at hudson.scm.SCM.checkout(SCM.java:495)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1276)
at 

Build failed in Jenkins: Nutch-trunk #3440

2017-07-26 Thread Apache Jenkins Server
See 

--
Started by an SCM change
[EnvInject] - Loading node environment variables.
Building remotely on cassandra14 (cassandra ubuntu) in workspace 

 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/nutch.git # timeout=10
Fetching upstream changes from https://github.com/apache/nutch.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/nutch.git 
 > +refs/heads/*:refs/remotes/origin/*
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from 
https://github.com/apache/nutch.git
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:812)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1079)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1110)
at hudson.scm.SCM.checkout(SCM.java:495)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1276)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:560)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:485)
at hudson.model.Run.execute(Run.java:1735)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:405)
Caused by: hudson.plugins.git.GitException: Command "git fetch --tags 
--progress https://github.com/apache/nutch.git 
+refs/heads/*:refs/remotes/origin/*" returned status code 128:
stdout: 
stderr: fatal: unable to access 'https://github.com/apache/nutch.git/': Could 
not resolve host: github.com

at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1903)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1622)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$300(CliGitAPIImpl.java:71)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl$1.execute(CliGitAPIImpl.java:348)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:153)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:146)
at hudson.remoting.UserRequest.perform(UserRequest.java:153)
at hudson.remoting.UserRequest.perform(UserRequest.java:50)
at hudson.remoting.Request$2.run(Request.java:336)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
at ..remote call to cassandra14(Native Method)
at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1545)
at hudson.remoting.UserResponse.retrieve(UserRequest.java:253)
at hudson.remoting.Channel.call(Channel.java:830)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.execute(RemoteGitImpl.java:146)
at sun.reflect.GeneratedMethodAccessor864.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.invoke(RemoteGitImpl.java:132)
at com.sun.proxy.$Proxy104.execute(Unknown Source)
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:810)
... 11 more
ERROR: Error fetching remote repo 'origin'
Retrying after 10 seconds
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/nutch.git # timeout=10
Fetching upstream changes from https://github.com/apache/nutch.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/nutch.git 
 > +refs/heads/*:refs/remotes/origin/*
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from 
https://github.com/apache/nutch.git
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:812)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1079)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1110)
at hudson.scm.SCM.checkout(SCM.java:495)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1276)
at 

Build failed in Jenkins: Nutch-trunk #3439

2017-07-26 Thread Apache Jenkins Server
See 

--
Started by an SCM change
[EnvInject] - Loading node environment variables.
Building remotely on cassandra14 (cassandra ubuntu) in workspace 

Cloning the remote Git repository
Cloning repository https://github.com/apache/nutch.git
 > git init  # timeout=10
Fetching upstream changes from https://github.com/apache/nutch.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/nutch.git 
 > +refs/heads/*:refs/remotes/origin/*
ERROR: Error cloning remote repo 'origin'
hudson.plugins.git.GitException: Command "git fetch --tags --progress 
https://github.com/apache/nutch.git +refs/heads/*:refs/remotes/origin/*" 
returned status code 128:
stdout: 
stderr: fatal: unable to access 'https://github.com/apache/nutch.git/': Could 
not resolve host: github.com

at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandIn(CliGitAPIImpl.java:1903)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.launchCommandWithCredentials(CliGitAPIImpl.java:1622)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl.access$300(CliGitAPIImpl.java:71)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl$1.execute(CliGitAPIImpl.java:348)
at 
org.jenkinsci.plugins.gitclient.CliGitAPIImpl$2.execute(CliGitAPIImpl.java:545)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:153)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler$1.call(RemoteGitImpl.java:146)
at hudson.remoting.UserRequest.perform(UserRequest.java:153)
at hudson.remoting.UserRequest.perform(UserRequest.java:50)
at hudson.remoting.Request$2.run(Request.java:336)
at 
hudson.remoting.InterceptingExecutorService$1.call(InterceptingExecutorService.java:68)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
at ..remote call to cassandra14(Native Method)
at hudson.remoting.Channel.attachCallSiteStackTrace(Channel.java:1545)
at hudson.remoting.UserResponse.retrieve(UserRequest.java:253)
at hudson.remoting.Channel.call(Channel.java:830)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.execute(RemoteGitImpl.java:146)
at sun.reflect.GeneratedMethodAccessor864.invoke(Unknown Source)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.jenkinsci.plugins.gitclient.RemoteGitImpl$CommandInvocationHandler.invoke(RemoteGitImpl.java:132)
at com.sun.proxy.$Proxy142.execute(Unknown Source)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1070)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1110)
at hudson.scm.SCM.checkout(SCM.java:495)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1276)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:560)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.run(AbstractBuild.java:485)
at hudson.model.Run.execute(Run.java:1735)
at hudson.model.FreeStyleBuild.run(FreeStyleBuild.java:43)
at hudson.model.ResourceController.execute(ResourceController.java:97)
at hudson.model.Executor.run(Executor.java:405)
ERROR: Error cloning remote repo 'origin'
Retrying after 10 seconds
 > git rev-parse --is-inside-work-tree # timeout=10
Fetching changes from the remote Git repository
 > git config remote.origin.url https://github.com/apache/nutch.git # timeout=10
Fetching upstream changes from https://github.com/apache/nutch.git
 > git --version # timeout=10
 > git fetch --tags --progress https://github.com/apache/nutch.git 
 > +refs/heads/*:refs/remotes/origin/*
ERROR: Error fetching remote repo 'origin'
hudson.plugins.git.GitException: Failed to fetch from 
https://github.com/apache/nutch.git
at hudson.plugins.git.GitSCM.fetchFrom(GitSCM.java:812)
at hudson.plugins.git.GitSCM.retrieveChanges(GitSCM.java:1079)
at hudson.plugins.git.GitSCM.checkout(GitSCM.java:1110)
at hudson.scm.SCM.checkout(SCM.java:495)
at hudson.model.AbstractProject.checkout(AbstractProject.java:1276)
at 
hudson.model.AbstractBuild$AbstractBuildExecution.defaultCheckout(AbstractBuild.java:560)
at jenkins.scm.SCMCheckoutStrategy.checkout(SCMCheckoutStrategy.java:86)
at 

[jira] [Resolved] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-07-26 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma resolved NUTCH-2368.
--
Resolution: Fixed

> Variable generate.max.count and fetcher.server.delay
> 
>
> Key: NUTCH-2368
> URL: https://issues.apache.org/jira/browse/NUTCH-2368
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch
>
>
> In some cases we need to use host specific characteristics in determining 
> crawl speed and bulk sizes because with our (Openindex) settings we can just 
> recrawl host with up to 800k urls.
> This patch solves the problem by introducing the HostDB to the Generator and 
> providing powerful Jexl expressions. Check these two expressions added to the 
> Generator:
> {code}
> -Dgenerate.max.count.expr='
> if (unfetched + fetched > 80) {
>   return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 
> 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1)
> } else {
>   return conf.getDouble("generate.max.count", 300);
> }'
> -Dgenerate.fetch.delay.expr='
> if (unfetched + fetched > 80) {
>   return (pct95._rs_ + 500);
> } else {
>   return conf.getDouble("fetcher.server.delay", 1000)
> }'
> {code}
> For each large host: select as many records as possible that are possible to 
> fetch based on number of threads, 95th percentile response time of the fetch 
> limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads.
> The second expression just follows up to that, settings the crawlDelay of the 
> fetch queue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-07-26 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101558#comment-16101558
 ] 

Markus Jelsma commented on NUTCH-2368:
--

Committed to master in 2de30d2e..44f7ad97  master -> master

> Variable generate.max.count and fetcher.server.delay
> 
>
> Key: NUTCH-2368
> URL: https://issues.apache.org/jira/browse/NUTCH-2368
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch
>
>
> In some cases we need to use host specific characteristics in determining 
> crawl speed and bulk sizes because with our (Openindex) settings we can just 
> recrawl host with up to 800k urls.
> This patch solves the problem by introducing the HostDB to the Generator and 
> providing powerful Jexl expressions. Check these two expressions added to the 
> Generator:
> {code}
> -Dgenerate.max.count.expr='
> if (unfetched + fetched > 80) {
>   return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 
> 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1)
> } else {
>   return conf.getDouble("generate.max.count", 300);
> }'
> -Dgenerate.fetch.delay.expr='
> if (unfetched + fetched > 80) {
>   return (pct95._rs_ + 500);
> } else {
>   return conf.getDouble("fetcher.server.delay", 1000)
> }'
> {code}
> For each large host: select as many records as possible that are possible to 
> fetch based on number of threads, 95th percentile response time of the fetch 
> limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads.
> The second expression just follows up to that, settings the crawlDelay of the 
> fetch queue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-07-26 Thread Markus Jelsma (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101557#comment-16101557
 ] 

Markus Jelsma commented on NUTCH-2368:
--

Hahahaha sure! Thanks Sebastian!

> Variable generate.max.count and fetcher.server.delay
> 
>
> Key: NUTCH-2368
> URL: https://issues.apache.org/jira/browse/NUTCH-2368
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch
>
>
> In some cases we need to use host specific characteristics in determining 
> crawl speed and bulk sizes because with our (Openindex) settings we can just 
> recrawl host with up to 800k urls.
> This patch solves the problem by introducing the HostDB to the Generator and 
> providing powerful Jexl expressions. Check these two expressions added to the 
> Generator:
> {code}
> -Dgenerate.max.count.expr='
> if (unfetched + fetched > 80) {
>   return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 
> 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1)
> } else {
>   return conf.getDouble("generate.max.count", 300);
> }'
> -Dgenerate.fetch.delay.expr='
> if (unfetched + fetched > 80) {
>   return (pct95._rs_ + 500);
> } else {
>   return conf.getDouble("fetcher.server.delay", 1000)
> }'
> {code}
> For each large host: select as many records as possible that are possible to 
> fetch based on number of threads, 95th percentile response time of the fetch 
> limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads.
> The second expression just follows up to that, settings the crawlDelay of the 
> fetch queue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Commented] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-07-26 Thread Sebastian Nagel (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16101552#comment-16101552
 ] 

Sebastian Nagel commented on NUTCH-2368:


+1

> Variable generate.max.count and fetcher.server.delay
> 
>
> Key: NUTCH-2368
> URL: https://issues.apache.org/jira/browse/NUTCH-2368
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch
>
>
> In some cases we need to use host specific characteristics in determining 
> crawl speed and bulk sizes because with our (Openindex) settings we can just 
> recrawl host with up to 800k urls.
> This patch solves the problem by introducing the HostDB to the Generator and 
> providing powerful Jexl expressions. Check these two expressions added to the 
> Generator:
> {code}
> -Dgenerate.max.count.expr='
> if (unfetched + fetched > 80) {
>   return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 
> 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1)
> } else {
>   return conf.getDouble("generate.max.count", 300);
> }'
> -Dgenerate.fetch.delay.expr='
> if (unfetched + fetched > 80) {
>   return (pct95._rs_ + 500);
> } else {
>   return conf.getDouble("fetcher.server.delay", 1000)
> }'
> {code}
> For each large host: select as many records as possible that are possible to 
> fetch based on number of threads, 95th percentile response time of the fetch 
> limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads.
> The second expression just follows up to that, settings the crawlDelay of the 
> fetch queue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)


[jira] [Updated] (NUTCH-2368) Variable generate.max.count and fetcher.server.delay

2017-07-26 Thread Markus Jelsma (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-2368?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-2368:
-
Attachment: NUTCH-2368.patch

Of course, good point. Final patch, if still something is wrong, let's delete 
the entire issue. Will commit shortly.

> Variable generate.max.count and fetcher.server.delay
> 
>
> Key: NUTCH-2368
> URL: https://issues.apache.org/jira/browse/NUTCH-2368
> Project: Nutch
>  Issue Type: Improvement
>  Components: generator
>Affects Versions: 1.12
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
> Fix For: 1.14
>
> Attachments: NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, NUTCH-2368.patch, 
> NUTCH-2368.patch, NUTCH-2368.patch
>
>
> In some cases we need to use host specific characteristics in determining 
> crawl speed and bulk sizes because with our (Openindex) settings we can just 
> recrawl host with up to 800k urls.
> This patch solves the problem by introducing the HostDB to the Generator and 
> providing powerful Jexl expressions. Check these two expressions added to the 
> Generator:
> {code}
> -Dgenerate.max.count.expr='
> if (unfetched + fetched > 80) {
>   return (conf.getInt("fetcher.timelimit.mins", 12) * 60) / ((pct95._rs_ + 
> 500) / 1000) * conf.getInt("fetcher.threads.per.queue", 1)
> } else {
>   return conf.getDouble("generate.max.count", 300);
> }'
> -Dgenerate.fetch.delay.expr='
> if (unfetched + fetched > 80) {
>   return (pct95._rs_ + 500);
> } else {
>   return conf.getDouble("fetcher.server.delay", 1000)
> }'
> {code}
> For each large host: select as many records as possible that are possible to 
> fetch based on number of threads, 95th percentile response time of the fetch 
> limit. Or: queueMaxCount = (timelimit / resonsetime) * numThreads.
> The second expression just follows up to that, settings the crawlDelay of the 
> fetch queue.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)