[jira] [Created] (NUTCH-1801) Fix chain of dependencies between ANT tasks

2014-06-26 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-1801:


 Summary: Fix chain of dependencies between ANT tasks
 Key: NUTCH-1801
 URL: https://issues.apache.org/jira/browse/NUTCH-1801
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.8
Reporter: Julien Nioche
 Fix For: 1.9


The chain of dependencies between ANT tasks needs fixing. The main issue is 
that the dependencies with a 'test' scope in Ivy are not resolved properly or 
rather the resolution task works fine but is not called from the upper level 
'test' tasks. 
This can easily be reproduced by marking the junit dependency in ivy.xml as 
conf=test-default.
The 'test-core' task for instance relies on the 'job' task which should not be 
the case.
Ideally we'd want to have a separate lib dir for the test dependencies so that 
they do not get copied into the job file where they are absolutely not needed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1801) Fix chain of dependencies between ANT tasks

2014-06-26 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1801:
-

Description: 
The chain of dependencies between ANT tasks needs fixing. The main issue is 
that the dependencies with a 'test' scope in Ivy are not resolved properly or 
rather the resolution task works fine but is not called from the upper level 
'test' tasks. 
This can easily be reproduced by marking the junit dependency in ivy.xml as 
conf=test-default.
Ideally we'd want to have a separate lib dir for the test dependencies so that 
they do not get copied into the job file where they are absolutely not needed.

  was:
The chain of dependencies between ANT tasks needs fixing. The main issue is 
that the dependencies with a 'test' scope in Ivy are not resolved properly or 
rather the resolution task works fine but is not called from the upper level 
'test' tasks. 
This can easily be reproduced by marking the junit dependency in ivy.xml as 
conf=test-default.
The 'test-core' task for instance relies on the 'job' task which should not be 
the case.
Ideally we'd want to have a separate lib dir for the test dependencies so that 
they do not get copied into the job file where they are absolutely not needed.


 Fix chain of dependencies between ANT tasks
 ---

 Key: NUTCH-1801
 URL: https://issues.apache.org/jira/browse/NUTCH-1801
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.8
Reporter: Julien Nioche
 Fix For: 1.9


 The chain of dependencies between ANT tasks needs fixing. The main issue is 
 that the dependencies with a 'test' scope in Ivy are not resolved properly or 
 rather the resolution task works fine but is not called from the upper level 
 'test' tasks. 
 This can easily be reproduced by marking the junit dependency in ivy.xml as 
 conf=test-default.
 Ideally we'd want to have a separate lib dir for the test dependencies so 
 that they do not get copied into the job file where they are absolutely not 
 needed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (NUTCH-1802) Move TestbedProxy to test environment

2014-06-26 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-1802:


 Summary: Move TestbedProxy to test environment 
 Key: NUTCH-1802
 URL: https://issues.apache.org/jira/browse/NUTCH-1802
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.8
Reporter: Julien Nioche


The proxy task relies on the test classpath but its code is in 
src/java/org/apache/nutch/tools/proxy. One of the benefits of moving it to 
tests is that its dependencies would not be shipped in the job file where they 
are not needed (e.g. servlet stuff). The Ant task would work as before.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (NUTCH-1803) Put test dependencies in a separate lib dir

2014-06-26 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-1803:


 Summary: Put test dependencies in a separate lib dir
 Key: NUTCH-1803
 URL: https://issues.apache.org/jira/browse/NUTCH-1803
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.8
Reporter: Julien Nioche


See main issue [NUTCH-1801]. This would mean that these libs do not get 
included in the job file and provides a cleaner separation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1803) Put test dependencies in a separate lib dir

2014-06-26 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1803?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1803:
-

Attachment: NUTCH-1803.patch

Patch which handles the test dependencies in a separate directory from the main 
deps + fixes order of dependencies so that the resolution of the test libs  is 
done prior to testing.

 Put test dependencies in a separate lib dir
 ---

 Key: NUTCH-1803
 URL: https://issues.apache.org/jira/browse/NUTCH-1803
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.8
Reporter: Julien Nioche
 Fix For: 1.9

 Attachments: NUTCH-1803.patch


 See main issue [NUTCH-1801]. This would mean that these libs do not get 
 included in the job file and provides a cleaner separation.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1801) Improve handling of test dependencies in ANT+Ivy

2014-06-26 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1801:
-

Description: 
The chain of dependencies between ANT tasks needs fixing. The main issue is 
that the dependencies with a 'test' scope in Ivy are not resolved properly or 
rather the resolution task works fine but is not called from the upper level 
'test' tasks. This can easily be reproduced by marking the junit dependency in 
ivy.xml as conf=test-default.

Ideally we'd want to have a separate lib dir for the test dependencies so that 
they do not get copied into the job file where they are absolutely not needed.

  was:
The chain of dependencies between ANT tasks needs fixing. The main issue is 
that the dependencies with a 'test' scope in Ivy are not resolved properly or 
rather the resolution task works fine but is not called from the upper level 
'test' tasks. 
This can easily be reproduced by marking the junit dependency in ivy.xml as 
conf=test-default.
Ideally we'd want to have a separate lib dir for the test dependencies so that 
they do not get copied into the job file where they are absolutely not needed.


 Improve handling of test dependencies in ANT+Ivy
 

 Key: NUTCH-1801
 URL: https://issues.apache.org/jira/browse/NUTCH-1801
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.8
Reporter: Julien Nioche
 Fix For: 1.9


 The chain of dependencies between ANT tasks needs fixing. The main issue is 
 that the dependencies with a 'test' scope in Ivy are not resolved properly or 
 rather the resolution task works fine but is not called from the upper level 
 'test' tasks. This can easily be reproduced by marking the junit dependency 
 in ivy.xml as conf=test-default.
 Ideally we'd want to have a separate lib dir for the test dependencies so 
 that they do not get copied into the job file where they are absolutely not 
 needed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (NUTCH-1804) Move JUnit dependency to test scope

2014-06-26 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-1804:


 Summary: Move JUnit dependency to test scope  
 Key: NUTCH-1804
 URL: https://issues.apache.org/jira/browse/NUTCH-1804
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.8
Reporter: Julien Nioche


Should work straight with core tests after applying [NUTCH-1803] but requires 
fixing the build for the plugins by either add the main test dependencies to 
their classpath or force them to declare JUnit as a test dependency in their 
own ivy.xml. The latter is probably cleaner but we need to make sure that the 
test dependencies do not get added to the built version of the plugin. 



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-1801) Improve handling of test dependencies in ANT+Ivy

2014-06-26 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-1801?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-1801:
-

Summary: Improve handling of test dependencies in ANT+Ivy  (was: Fix chain 
of dependencies between ANT tasks)

 Improve handling of test dependencies in ANT+Ivy
 

 Key: NUTCH-1801
 URL: https://issues.apache.org/jira/browse/NUTCH-1801
 Project: Nutch
  Issue Type: Bug
  Components: build
Affects Versions: 1.8
Reporter: Julien Nioche
 Fix For: 1.9


 The chain of dependencies between ANT tasks needs fixing. The main issue is 
 that the dependencies with a 'test' scope in Ivy are not resolved properly or 
 rather the resolution task works fine but is not called from the upper level 
 'test' tasks. 
 This can easily be reproduced by marking the junit dependency in ivy.xml as 
 conf=test-default.
 Ideally we'd want to have a separate lib dir for the test dependencies so 
 that they do not get copied into the job file where they are absolutely not 
 needed.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1802) Move TestbedProxy to test environment

2014-06-26 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1802?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14044542#comment-14044542
 ] 

Julien Nioche commented on NUTCH-1802:
--

We'll just need to rename the main class to something not beginning with Test* 
so that we don't try to call the JUnit test bed on it.

 Move TestbedProxy to test environment 
 --

 Key: NUTCH-1802
 URL: https://issues.apache.org/jira/browse/NUTCH-1802
 Project: Nutch
  Issue Type: Sub-task
  Components: build
Affects Versions: 1.8
Reporter: Julien Nioche
 Fix For: 1.9


 The proxy task relies on the test classpath but its code is in 
 src/java/org/apache/nutch/tools/proxy. One of the benefits of moving it to 
 tests is that its dependencies would not be shipped in the job file where 
 they are not needed (e.g. servlet stuff). The Ant task would work as before.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (NUTCH-1805) Remove unnecessary transitive dependencies from Hadoop core

2014-06-26 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-1805:


 Summary: Remove unnecessary transitive dependencies from Hadoop 
core 
 Key: NUTCH-1805
 URL: https://issues.apache.org/jira/browse/NUTCH-1805
 Project: Nutch
  Issue Type: Improvement
  Components: build
Affects Versions: 1.8
Reporter: Julien Nioche


The Hadoop libs are not included in the job file as a Hadoop cluster must 
already be available in order to use it, however some of its transitive 
dependencies make it to the job file. We already prevent some but could extend 
that to :
exclude org=org.mortbay.jetty/
exclude org=com.sun.jersey/
exclude org=tomcat/
Note that we need some of the Hadoop classes and dependencies in order to run 
Nutch in local mode.

Alternatively we could have a separate Ivy profile only for Hadoop and store 
the dependencies in a separate location so that they do not get copied to the 
job jar, however this is probably an overkill if the dependencies above are not 
needed when running in local mode.






--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Created] (NUTCH-1806) Delegate processing of URL domains to crawler commons

2014-06-26 Thread Julien Nioche (JIRA)
Julien Nioche created NUTCH-1806:


 Summary: Delegate processing of URL domains to crawler commons
 Key: NUTCH-1806
 URL: https://issues.apache.org/jira/browse/NUTCH-1806
 Project: Nutch
  Issue Type: Improvement
Affects Versions: 1.8
Reporter: Julien Nioche


We have code in src/java/org/apache/nutch/util/domain and a resource file 
conf/domain-suffixes.xml to handle URL domains. This is used mostly from 
URLUtil.getDomainName.

The resource file is not necessarily up to date and since crawler commons has a 
similar functionality we should use it instead of having to maintain our own 
resources.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Assigned] (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

2014-06-26 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche reassigned NUTCH-385:
---

Assignee: Julien Nioche

 Server delay feature conflicts with maxThreadsPerHost
 -

 Key: NUTCH-385
 URL: https://issues.apache.org/jira/browse/NUTCH-385
 Project: Nutch
  Issue Type: Bug
  Components: documentation, fetcher
Reporter: Chris Schneider
Assignee: Julien Nioche

 For some time I've been puzzled by the interaction between two paramters that 
 control how often the fetcher can access a particular host:
 1) The server delay, which comes back from the remote server during our 
 processing of the robots.txt file, and which can be limited by 
 fetcher.max.crawl.delay.
 2) The fetcher.threads.per.host value, particularly when this is greater than 
 the default of 1.
 According to my (limited) understanding of the code in HttpBase.java:
 Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
 ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
 continuously. In other words, it never tries to point 3 at the host, and it 
 always points a second thread at the host before the first thread finishes 
 accessing it. Since HttpBase.unblockAddr never gets called with 
 (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. Thus, the server delay will never be used at all. The fetcher will be 
 continuously retrieving pages from the host, often with 2 fetchers accessing 
 the host simultaneously.
 Suppose instead that the fetcher finally does allow the last thread to 
 complete before it gets around to pointing another thread at the target host. 
 When the last fetcher thread calls HttpBase.unblockAddr, it will now put 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. This, in turn, will prevent any threads from accessing this host until 
 the delay is complete, even though zero threads are currently accessing the 
 host.
 I see this behavior as inconsistent. More importantly, the current 
 implementation certainly doesn't seem to answer my original question about 
 appropriate definitions for what appear to be conflicting parameters. 
 In a nutshell, how could we possibly honor the server delay if we allow more 
 than one fetcher thread to simultaneously access the host?
 It would be one thing if whenever (fetcher.threads.per.host  1), this 
 trumped the server delay, causing the latter to be ignored completely. That 
 is certainly not the case in the current implementation, as it will wait for 
 server delay whenever the number of threads accessing a given host drops to 
 zero.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

2014-06-26 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-385:


Attachment: NUTCH-385.patch

Improved description of the thread related configuration for the Fetcher. The 
Fetcher implementation has changed a lot since this issue was opened and the 
descriptions in nutch-default.xml reflect the currrent behaviour.

[~schmed] Ok with these changes? BTW do you still use Nutch 8 years after 
opening the issue? 

 Server delay feature conflicts with maxThreadsPerHost
 -

 Key: NUTCH-385
 URL: https://issues.apache.org/jira/browse/NUTCH-385
 Project: Nutch
  Issue Type: Bug
  Components: documentation, fetcher
Reporter: Chris Schneider
Assignee: Julien Nioche
 Attachments: NUTCH-385.patch


 For some time I've been puzzled by the interaction between two paramters that 
 control how often the fetcher can access a particular host:
 1) The server delay, which comes back from the remote server during our 
 processing of the robots.txt file, and which can be limited by 
 fetcher.max.crawl.delay.
 2) The fetcher.threads.per.host value, particularly when this is greater than 
 the default of 1.
 According to my (limited) understanding of the code in HttpBase.java:
 Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
 ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
 continuously. In other words, it never tries to point 3 at the host, and it 
 always points a second thread at the host before the first thread finishes 
 accessing it. Since HttpBase.unblockAddr never gets called with 
 (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. Thus, the server delay will never be used at all. The fetcher will be 
 continuously retrieving pages from the host, often with 2 fetchers accessing 
 the host simultaneously.
 Suppose instead that the fetcher finally does allow the last thread to 
 complete before it gets around to pointing another thread at the target host. 
 When the last fetcher thread calls HttpBase.unblockAddr, it will now put 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. This, in turn, will prevent any threads from accessing this host until 
 the delay is complete, even though zero threads are currently accessing the 
 host.
 I see this behavior as inconsistent. More importantly, the current 
 implementation certainly doesn't seem to answer my original question about 
 appropriate definitions for what appear to be conflicting parameters. 
 In a nutshell, how could we possibly honor the server delay if we allow more 
 than one fetcher thread to simultaneously access the host?
 It would be one thing if whenever (fetcher.threads.per.host  1), this 
 trumped the server delay, causing the latter to be ignored completely. That 
 is certainly not the case in the current implementation, as it will wait for 
 server delay whenever the number of threads accessing a given host drops to 
 zero.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

2014-06-26 Thread Chris Schneider (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14044743#comment-14044743
 ] 

Chris Schneider commented on NUTCH-385:
---

Hi Julien,

Thanks for the documentation changes and for investing your time in an issue I 
raised so long ago. Unfortunately (since I haven't used Nutch in the past 5 
years), it would be difficult for me to validate that your description of the 
fetcher behavior is correct and sufficient. I would recommend that you ask 
Andrzej (or perhaps Doug) to review them instead.

Best Regards,

Chris

 Server delay feature conflicts with maxThreadsPerHost
 -

 Key: NUTCH-385
 URL: https://issues.apache.org/jira/browse/NUTCH-385
 Project: Nutch
  Issue Type: Bug
  Components: documentation, fetcher
Reporter: Chris Schneider
Assignee: Julien Nioche
 Attachments: NUTCH-385.patch


 For some time I've been puzzled by the interaction between two paramters that 
 control how often the fetcher can access a particular host:
 1) The server delay, which comes back from the remote server during our 
 processing of the robots.txt file, and which can be limited by 
 fetcher.max.crawl.delay.
 2) The fetcher.threads.per.host value, particularly when this is greater than 
 the default of 1.
 According to my (limited) understanding of the code in HttpBase.java:
 Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
 ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
 continuously. In other words, it never tries to point 3 at the host, and it 
 always points a second thread at the host before the first thread finishes 
 accessing it. Since HttpBase.unblockAddr never gets called with 
 (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. Thus, the server delay will never be used at all. The fetcher will be 
 continuously retrieving pages from the host, often with 2 fetchers accessing 
 the host simultaneously.
 Suppose instead that the fetcher finally does allow the last thread to 
 complete before it gets around to pointing another thread at the target host. 
 When the last fetcher thread calls HttpBase.unblockAddr, it will now put 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. This, in turn, will prevent any threads from accessing this host until 
 the delay is complete, even though zero threads are currently accessing the 
 host.
 I see this behavior as inconsistent. More importantly, the current 
 implementation certainly doesn't seem to answer my original question about 
 appropriate definitions for what appear to be conflicting parameters. 
 In a nutshell, how could we possibly honor the server delay if we allow more 
 than one fetcher thread to simultaneously access the host?
 It would be one thing if whenever (fetcher.threads.per.host  1), this 
 trumped the server delay, causing the latter to be ignored completely. That 
 is certainly not the case in the current implementation, as it will wait for 
 server delay whenever the number of threads accessing a given host drops to 
 zero.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

2014-06-26 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-385:


Fix Version/s: 1.9

 Server delay feature conflicts with maxThreadsPerHost
 -

 Key: NUTCH-385
 URL: https://issues.apache.org/jira/browse/NUTCH-385
 Project: Nutch
  Issue Type: Bug
  Components: documentation, fetcher
Reporter: Chris Schneider
Assignee: Julien Nioche
 Fix For: 1.9

 Attachments: NUTCH-385.patch


 For some time I've been puzzled by the interaction between two paramters that 
 control how often the fetcher can access a particular host:
 1) The server delay, which comes back from the remote server during our 
 processing of the robots.txt file, and which can be limited by 
 fetcher.max.crawl.delay.
 2) The fetcher.threads.per.host value, particularly when this is greater than 
 the default of 1.
 According to my (limited) understanding of the code in HttpBase.java:
 Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
 ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
 continuously. In other words, it never tries to point 3 at the host, and it 
 always points a second thread at the host before the first thread finishes 
 accessing it. Since HttpBase.unblockAddr never gets called with 
 (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. Thus, the server delay will never be used at all. The fetcher will be 
 continuously retrieving pages from the host, often with 2 fetchers accessing 
 the host simultaneously.
 Suppose instead that the fetcher finally does allow the last thread to 
 complete before it gets around to pointing another thread at the target host. 
 When the last fetcher thread calls HttpBase.unblockAddr, it will now put 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. This, in turn, will prevent any threads from accessing this host until 
 the delay is complete, even though zero threads are currently accessing the 
 host.
 I see this behavior as inconsistent. More importantly, the current 
 implementation certainly doesn't seem to answer my original question about 
 appropriate definitions for what appear to be conflicting parameters. 
 In a nutshell, how could we possibly honor the server delay if we allow more 
 than one fetcher thread to simultaneously access the host?
 It would be one thing if whenever (fetcher.threads.per.host  1), this 
 trumped the server delay, causing the latter to be ignored completely. That 
 is certainly not the case in the current implementation, as it will wait for 
 server delay whenever the number of threads accessing a given host drops to 
 zero.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-385) Server delay feature conflicts with maxThreadsPerHost

2014-06-26 Thread Julien Nioche (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14044767#comment-14044767
 ] 

Julien Nioche commented on NUTCH-385:
-

Will commit shortly unless someone objects or proposes a better formulation


 Server delay feature conflicts with maxThreadsPerHost
 -

 Key: NUTCH-385
 URL: https://issues.apache.org/jira/browse/NUTCH-385
 Project: Nutch
  Issue Type: Bug
  Components: documentation, fetcher
Reporter: Chris Schneider
Assignee: Julien Nioche
 Fix For: 1.9

 Attachments: NUTCH-385.patch


 For some time I've been puzzled by the interaction between two paramters that 
 control how often the fetcher can access a particular host:
 1) The server delay, which comes back from the remote server during our 
 processing of the robots.txt file, and which can be limited by 
 fetcher.max.crawl.delay.
 2) The fetcher.threads.per.host value, particularly when this is greater than 
 the default of 1.
 According to my (limited) understanding of the code in HttpBase.java:
 Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
 ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
 continuously. In other words, it never tries to point 3 at the host, and it 
 always points a second thread at the host before the first thread finishes 
 accessing it. Since HttpBase.unblockAddr never gets called with 
 (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. Thus, the server delay will never be used at all. The fetcher will be 
 continuously retrieving pages from the host, often with 2 fetchers accessing 
 the host simultaneously.
 Suppose instead that the fetcher finally does allow the last thread to 
 complete before it gets around to pointing another thread at the target host. 
 When the last fetcher thread calls HttpBase.unblockAddr, it will now put 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. This, in turn, will prevent any threads from accessing this host until 
 the delay is complete, even though zero threads are currently accessing the 
 host.
 I see this behavior as inconsistent. More importantly, the current 
 implementation certainly doesn't seem to answer my original question about 
 appropriate definitions for what appear to be conflicting parameters. 
 In a nutshell, how could we possibly honor the server delay if we allow more 
 than one fetcher thread to simultaneously access the host?
 It would be one thing if whenever (fetcher.threads.per.host  1), this 
 trumped the server delay, causing the latter to be ignored completely. That 
 is certainly not the case in the current implementation, as it will wait for 
 server delay whenever the number of threads accessing a given host drops to 
 zero.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Updated] (NUTCH-385) Improve description of thread related configuration for Fetcher

2014-06-26 Thread Julien Nioche (JIRA)

 [ 
https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julien Nioche updated NUTCH-385:


Summary: Improve description of thread related configuration for Fetcher  
(was: Server delay feature conflicts with maxThreadsPerHost)

 Improve description of thread related configuration for Fetcher
 ---

 Key: NUTCH-385
 URL: https://issues.apache.org/jira/browse/NUTCH-385
 Project: Nutch
  Issue Type: Bug
  Components: documentation, fetcher
Reporter: Chris Schneider
Assignee: Julien Nioche
 Fix For: 1.9

 Attachments: NUTCH-385.patch


 For some time I've been puzzled by the interaction between two paramters that 
 control how often the fetcher can access a particular host:
 1) The server delay, which comes back from the remote server during our 
 processing of the robots.txt file, and which can be limited by 
 fetcher.max.crawl.delay.
 2) The fetcher.threads.per.host value, particularly when this is greater than 
 the default of 1.
 According to my (limited) understanding of the code in HttpBase.java:
 Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
 ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
 continuously. In other words, it never tries to point 3 at the host, and it 
 always points a second thread at the host before the first thread finishes 
 accessing it. Since HttpBase.unblockAddr never gets called with 
 (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. Thus, the server delay will never be used at all. The fetcher will be 
 continuously retrieving pages from the host, often with 2 fetchers accessing 
 the host simultaneously.
 Suppose instead that the fetcher finally does allow the last thread to 
 complete before it gets around to pointing another thread at the target host. 
 When the last fetcher thread calls HttpBase.unblockAddr, it will now put 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. This, in turn, will prevent any threads from accessing this host until 
 the delay is complete, even though zero threads are currently accessing the 
 host.
 I see this behavior as inconsistent. More importantly, the current 
 implementation certainly doesn't seem to answer my original question about 
 appropriate definitions for what appear to be conflicting parameters. 
 In a nutshell, how could we possibly honor the server delay if we allow more 
 than one fetcher thread to simultaneously access the host?
 It would be one thing if whenever (fetcher.threads.per.host  1), this 
 trumped the server delay, causing the latter to be ignored completely. That 
 is certainly not the case in the current implementation, as it will wait for 
 server delay whenever the number of threads accessing a given host drops to 
 zero.



--
This message was sent by Atlassian JIRA
(v6.2#6252)


[jira] [Commented] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-26 Thread Aaron Bedward (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045224#comment-14045224
 ] 

Aaron Bedward commented on NUTCH-1798:
--

Right... i have made a few observations (i may have misunderstood the 
architecture but please bare with me)

I have managed to get 2.x indexing with ES  by making the following changes to 
the crawl script

Line 149:  echo Indexing $CRAWL_ID on SOLR index - $SOLRURL
-Line 150:  $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID
Line 150:  $bin/nutch solrindex $SOLRURL -all -crawlId $CRAWL_ID

Example call: ./bin/crawl urls test http://localhost:9300  2

However i believe the script should  use  $bin/nutch index -D 
solr.server.url=$SOLRURL 

Hope this helps anybody trying to use ES, i will commit my source code MongoDB 
over the weekend

 Unable to get any documents to index in elastic search
 --

 Key: NUTCH-1798
 URL: https://issues.apache.org/jira/browse/NUTCH-1798
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9
Reporter: Aaron Bedward
 Fix For: 2.3

 Attachments: part-r-0


 Hopefully this is something i am doing wrong.  I have checked out 2.x as i 
 would like to use the new metatag extraction features.  I have then run ant 
 runtime to build,  I have updated the nutch-site.xml like so:
 property
   nameplugin.includes/name
  
 valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value
  descriptionRegular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please enable 
   protocol-httpclient, but be aware of possible intermittent problems with 
 the 
   underlying commons-httpclient library.
   /description
 /property
   property
   nameelastic.cluster/name
   valueelasticsearch/value
   descriptionThe cluster name to discover. Either host and potr must be 
 defined
 or cluster./description
   /property
  
 I have then created a folder called urls and added seed.txt.
 i ran the following commands 
 bin/nutch inject urls
 bin/nutch generate -topN 1000  
 bin/nutch fetch -all
 bin/nutch parse -all
 bin/nutch updatedb
 bin/nutch index  -all 
 it runs no errors however no documents have been index
 i also tried setting up the following with solr and no documents are indexed
 Log:
 2014-06-24 02:57:57,804 INFO  parse.ParserJob - ParserJob: success
 2014-06-24 02:57:57,805 INFO  parse.ParserJob - ParserJob: finished at 
 2014-06-24 02:57:57, time elapsed: 00:00:06
 2014-06-24 02:57:59,823 INFO  indexer.IndexingJob - IndexingJob: starting
 2014-06-24 02:58:00,815 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 02:58:00,815 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2014-06-24 02:58:01,774 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.more.MoreIndexingFilter
 2014-06-24 02:58:01,776 INFO  anchor.AnchorIndexingFilter - Anchor 
 deduplication is: off
 2014-06-24 02:58:01,776 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.anchor.AnchorIndexingFilter
 2014-06-24 02:58:03,946 WARN  util.NativeCodeLoader - Unable to load 
 native-hadoop library for your platform... using builtin-java classes where 
 applicable
 2014-06-24 02:58:04,920 INFO  indexer.IndexWriters - Adding 
 org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] version[1.1.0], 
 pid[21885], build[2181e11/2014-03-25T15:59:51Z]
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] initializing ...
 2014-06-24 02:58:05,377 INFO  elasticsearch.plugins - [Silver] loaded [], 
 sites []
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] initialized
 2014-06-24 02:58:08,339 INFO  elasticsearch.node - [Silver] starting ...
 2014-06-24 02:58:08,431 INFO  elasticsearch.transport - [Silver] 
 bound_address {inet[/0:0:0:0:0:0:0:0:9301]}, publish_address 
 {inet[/10.0.2.15:9301]}
 2014-06-24 02:58:11,540 INFO  cluster.service - [Silver] detected_master 
 [Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]], 
 added 
 {[Doughboy][U02ugUDtRZW4ttx6lbyMLg][dev-ElasticSearch][inet[/10.0.2.4:9300]],[Silver
  Squire][2NyU10FARvaL92rU5GqpcA][nutch][inet[/10.0.2.15:9300]],}, reason: 
 zen-disco-receive(from master 
 

[jira] [Comment Edited] (NUTCH-1798) Unable to get any documents to index in elastic search

2014-06-26 Thread Aaron Bedward (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-1798?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045224#comment-14045224
 ] 

Aaron Bedward edited comment on NUTCH-1798 at 6/26/14 9:32 PM:
---

Right... i have made a few observations (i may have misunderstood the 
architecture but please bare with me)

I have managed to get 2.x indexing with ES  by making the following changes to 
the crawl script

Line 149:  echo Indexing $CRAWL_ID on SOLR index - $SOLRURL
-Line 150:  $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID
Line 150:  $bin/nutch solrindex $SOLRURL -all -crawlId $CRAWL_ID

Example call: ./bin/crawl urls test http://localhost:9300  2

However i believe the script should  use  $bin/nutch index -D 
solr.server.url=$SOLRURL 

Hope this helps anybody trying to use ES, i will commit my source code for 
MongoDB over the weekend


was (Author: mrbedward):
Right... i have made a few observations (i may have misunderstood the 
architecture but please bare with me)

I have managed to get 2.x indexing with ES  by making the following changes to 
the crawl script

Line 149:  echo Indexing $CRAWL_ID on SOLR index - $SOLRURL
-Line 150:  $bin/nutch solrindex $commonOptions $SOLRURL -all -crawlId $CRAWL_ID
Line 150:  $bin/nutch solrindex $SOLRURL -all -crawlId $CRAWL_ID

Example call: ./bin/crawl urls test http://localhost:9300  2

However i believe the script should  use  $bin/nutch index -D 
solr.server.url=$SOLRURL 

Hope this helps anybody trying to use ES, i will commit my source code MongoDB 
over the weekend

 Unable to get any documents to index in elastic search
 --

 Key: NUTCH-1798
 URL: https://issues.apache.org/jira/browse/NUTCH-1798
 Project: Nutch
  Issue Type: Bug
Affects Versions: 2.3
 Environment: Ubuntu 13.10, Elasticsearch 1, HBASE 0.94.9
Reporter: Aaron Bedward
 Fix For: 2.3

 Attachments: part-r-0


 Hopefully this is something i am doing wrong.  I have checked out 2.x as i 
 would like to use the new metatag extraction features.  I have then run ant 
 runtime to build,  I have updated the nutch-site.xml like so:
 property
   nameplugin.includes/name
  
 valueprotocol-http|urlfilter-regex|parse-(html|tika|metatags)|index-(basic|anchor|more|metatags)|indexer-elasticsearch|urlnormalizer-(pass|regex|basic)|scoring-opic/value
  descriptionRegular expression naming plugin directory names to
   include.  Any plugin not matching this expression is excluded.
   In any case you need at least include the nutch-extensionpoints plugin. By
   default Nutch includes crawling just HTML and plain text via HTTP,
   and basic indexing and search plugins. In order to use HTTPS please enable 
   protocol-httpclient, but be aware of possible intermittent problems with 
 the 
   underlying commons-httpclient library.
   /description
 /property
   property
   nameelastic.cluster/name
   valueelasticsearch/value
   descriptionThe cluster name to discover. Either host and potr must be 
 defined
 or cluster./description
   /property
  
 I have then created a folder called urls and added seed.txt.
 i ran the following commands 
 bin/nutch inject urls
 bin/nutch generate -topN 1000  
 bin/nutch fetch -all
 bin/nutch parse -all
 bin/nutch updatedb
 bin/nutch index  -all 
 it runs no errors however no documents have been index
 i also tried setting up the following with solr and no documents are indexed
 Log:
 2014-06-24 02:57:57,804 INFO  parse.ParserJob - ParserJob: success
 2014-06-24 02:57:57,805 INFO  parse.ParserJob - ParserJob: finished at 
 2014-06-24 02:57:57, time elapsed: 00:00:06
 2014-06-24 02:57:59,823 INFO  indexer.IndexingJob - IndexingJob: starting
 2014-06-24 02:58:00,815 INFO  basic.BasicIndexingFilter - Maximum title 
 length for indexing set to: 100
 2014-06-24 02:58:00,815 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.basic.BasicIndexingFilter
 2014-06-24 02:58:01,774 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.more.MoreIndexingFilter
 2014-06-24 02:58:01,776 INFO  anchor.AnchorIndexingFilter - Anchor 
 deduplication is: off
 2014-06-24 02:58:01,776 INFO  indexer.IndexingFilters - Adding 
 org.apache.nutch.indexer.anchor.AnchorIndexingFilter
 2014-06-24 02:58:03,946 WARN  util.NativeCodeLoader - Unable to load 
 native-hadoop library for your platform... using builtin-java classes where 
 applicable
 2014-06-24 02:58:04,920 INFO  indexer.IndexWriters - Adding 
 org.apache.nutch.indexwriter.elastic.ElasticIndexWriter
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] version[1.1.0], 
 pid[21885], build[2181e11/2014-03-25T15:59:51Z]
 2014-06-24 02:58:05,261 INFO  elasticsearch.node - [Silver] initializing ...
 2014-06-24 02:58:05,377 INFO  

[jira] [Commented] (NUTCH-385) Improve description of thread related configuration for Fetcher

2014-06-26 Thread lufeng (JIRA)

[ 
https://issues.apache.org/jira/browse/NUTCH-385?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14045525#comment-14045525
 ] 

lufeng commented on NUTCH-385:
--

Hi Julien

I see the description of fetcher.threads.per.queue we can add setting 
fetcher.threads.per.queue to value  1 will also cause fetcher.server.delay 
to be ignore. 

Another issue is that I think this property fetcher.max.crawl.delay is not 
uniform with fetcher.server.delay and fetcher.server.min.delay. It is 
changed to fetcher.server.max.delay more suitable?


 Improve description of thread related configuration for Fetcher
 ---

 Key: NUTCH-385
 URL: https://issues.apache.org/jira/browse/NUTCH-385
 Project: Nutch
  Issue Type: Bug
  Components: documentation, fetcher
Reporter: Chris Schneider
Assignee: Julien Nioche
 Fix For: 1.9

 Attachments: NUTCH-385.patch


 For some time I've been puzzled by the interaction between two paramters that 
 control how often the fetcher can access a particular host:
 1) The server delay, which comes back from the remote server during our 
 processing of the robots.txt file, and which can be limited by 
 fetcher.max.crawl.delay.
 2) The fetcher.threads.per.host value, particularly when this is greater than 
 the default of 1.
 According to my (limited) understanding of the code in HttpBase.java:
 Suppose that fetcher.threads.per.host is 2, and that (by chance) the fetcher 
 ends up keeping either 1 or 2 fetcher threads pointing at a particular host 
 continuously. In other words, it never tries to point 3 at the host, and it 
 always points a second thread at the host before the first thread finishes 
 accessing it. Since HttpBase.unblockAddr never gets called with 
 (((Integer)THREADS_PER_HOST_COUNT.get(host)).intValue() == 1), it never puts 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. Thus, the server delay will never be used at all. The fetcher will be 
 continuously retrieving pages from the host, often with 2 fetchers accessing 
 the host simultaneously.
 Suppose instead that the fetcher finally does allow the last thread to 
 complete before it gets around to pointing another thread at the target host. 
 When the last fetcher thread calls HttpBase.unblockAddr, it will now put 
 System.currentTimeMillis() + crawlDelay into BLOCKED_ADDR_TO_TIME for the 
 host. This, in turn, will prevent any threads from accessing this host until 
 the delay is complete, even though zero threads are currently accessing the 
 host.
 I see this behavior as inconsistent. More importantly, the current 
 implementation certainly doesn't seem to answer my original question about 
 appropriate definitions for what appear to be conflicting parameters. 
 In a nutshell, how could we possibly honor the server delay if we allow more 
 than one fetcher thread to simultaneously access the host?
 It would be one thing if whenever (fetcher.threads.per.host  1), this 
 trumped the server delay, causing the latter to be ignored completely. That 
 is certainly not the case in the current implementation, as it will wait for 
 server delay whenever the number of threads accessing a given host drops to 
 zero.



--
This message was sent by Atlassian JIRA
(v6.2#6252)