Jenkins build is back to normal : Nutch-trunk #3239
See https://builds.apache.org/job/Nutch-trunk/3239/
Re: GSOC2015- Sitemap crawler roudmap problems
Hi I am proceesing my work. My code is integreted nutch life cycle. Sitemap files are can injeceted and parsed. You known, sitemap file have any tags as lastmodified, priortiy and changefreq. Firstly, I put the tags value to metadata. Then, I update last modified and fetch inteval field of webpage as for the tags. But I didn't use priority tags. I want to calculate new score using priority for list of urls from sitemap. While the urls of sitemap have priority value, another webpage urls doesn't have the value. There are disorder. How do you think should be implemented it? I attached the last code as patch on this email. 2015-07-11 12:10 GMT+03:00 Cihad Guzel cguz...@gmail.com: Hi Lewis. Thanks for your suggestions. I will be thinking about this. 2015-07-10 3:47 GMT+03:00 Lewis John Mcgibbney lewis.mcgibb...@gmail.com : Hi Cihad, I'll take a look tonight. My understanding is that this would be implemented as part of core and not as a plugin. Within the plugin we can, at time, have acesss to less verbose data structures. This is of course not always the case, but generally speaking we see more issues, depending on which interfaces we extend, with appropriate access to the correct data structures. We then have the issue of dependency management. I'll have a look through the various links you have sent and then write back here in due course. Apologies about the delay. Thanks On Mon, Jul 6, 2015 at 12:20 AM, Cihad Guzel cguz...@gmail.com wrote: Hi, I have find a patch for my metadata problem [1]. But , the problem isn't solved for 2.x [2]. I guess, I need to solve it. [1] https://issues.apache.org/jira/browse/NUTCH-1622 [2] https://issues.apache.org/jira/browse/NUTCH-1816 2015-07-04 15:56 GMT+03:00 Cihad Guzel cguz...@gmail.com: Hi Lewis, I and Talat talk about architecture for sitemap supporting . We thought the problem could be solved in nutch life cycle . We don't want to build a different life cycle for sitemap crawling. So, I have some problems as following: If the sitemap file is too large size, it can not be fetched and parsed. It gets timeout. I solved timeout problem temporarily to parse by raising the value of timeout in nutch-site.xml and to fetch by working small size file. It is not good. Moreover, you know sitemap files have some special tags as loc, lastmod, changefreq or priority. It has been parsed using my parse plugin. I want to record to crawldb, but the Parse object doesn't support metadata or same fields. It has only outlink array. It isn't enough for recording metadata. I want to record each url in sitemap file with the metadata seperately. I viewed all patchs and comments from NUTCH-1465 and there are some solution for same problems in it. But, new job for sitemap crawling have been created. Could you show me a way out? Thanks. -- *Lewis* diff --git a/conf/gora-hbase-mapping.xml b/conf/gora-hbase-mapping.xml index eb58819..5bd011b 100644 --- a/conf/gora-hbase-mapping.xml +++ b/conf/gora-hbase-mapping.xml @@ -46,6 +46,7 @@ http://gora.apache.org/current/gora-hbase.html family name=s maxVersions=1/ family name=il maxVersions=1/ family name=ol maxVersions=1/ +family name=stm maxVersions=1/ family name=h maxVersions=1/ family name=mtdt maxVersions=1/ family name=mk maxVersions=1/ @@ -66,6 +67,8 @@ http://gora.apache.org/current/gora-hbase.html field name=modifiedTime family=f qualifier=mod/ field name=prevModifiedTime family=f qualifier=pmod/ field name=batchId family=f qualifier=bid/ + field name=sitemaps family=stm/ + !-- parse fields -- field name=title family=p qualifier=t/ @@ -76,6 +79,8 @@ http://gora.apache.org/current/gora-hbase.html !-- score fields -- field name=score family=s qualifier=s/ +field name=stmPriority family=s qualifier=sp/ + field name=headers family=h/ field name=inlinks family=il/ field name=outlinks family=ol/ diff --git a/conf/parse-plugins.xml b/conf/parse-plugins.xml index 5b20be6..0551381 100644 --- a/conf/parse-plugins.xml +++ b/conf/parse-plugins.xml @@ -68,6 +68,7 @@ plugin id=feed / /mimeType + !-- Types for parse-ext plugin: required for unit tests to pass. -- mimeType name=application/vnd.nutch.example.cat diff --git a/src/gora/webpage.avsc b/src/gora/webpage.avsc index dce0050..0761c08 100644 --- a/src/gora/webpage.avsc +++ b/src/gora/webpage.avsc @@ -278,6 +278,26 @@ ], doc: A batchId that this WebPage is assigned to. WebPage's are fetched in batches, called fetchlists. Pages are partitioned but can always be associated and fetched alongside pages of similar value (within a crawl cycle) based on batchId., default: null +}, +{ + name: sitemaps, + type: { +
[jira] [Commented] (NUTCH-2059) protocol-httpclient, protocol-http unit test errors on Jenkins
[ https://issues.apache.org/jira/browse/NUTCH-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651127#comment-14651127 ] Chris A. Mattmann commented on NUTCH-2059: -- ping thoughts here? Doesn't seem to be a broken build in a while but maybe we should push your updates regardless Peter? protocol-httpclient, protocol-http unit test errors on Jenkins -- Key: NUTCH-2059 URL: https://issues.apache.org/jira/browse/NUTCH-2059 Project: Nutch Issue Type: Bug Components: fetcher Reporter: Peter Ciuffetti Assignee: Chris A. Mattmann Fix For: 1.11 This is an occasional error on the build of the Nutch trunk visible in Jenkins builds. It happens on either protocol-http or protocol-httpclient, which can be running at the same time given the multi-threaded test setup. {code} [junit] Running org.apache.nutch.protocol.httpclient.TestProtocolHttpClient [junit] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 2.377 sec [junit] Test org.apache.nutch.protocol.http.TestProtocolHttp FAILED {code} Evidence of failure on Jenkins go back to Failed Console Output #3154Jun 8, 2015 4:00:00 AM https://builds.apache.org/view/All/job/Nutch-trunk/3154/consoleFull And are repeated at... https://builds.apache.org/view/All/job/Nutch-trunk/3190/console https://builds.apache.org/view/All/job/Nutch-trunk/3189/console Some possibly related tickets NUTCH-1836 Timeouts in protocol-httpclient when crawling same host with 2 threads NUTCH-1086 Rewrite protocol-httpclient The unit tests are not failing for me on my sandbox, but there are some exceptions being output to the log related to headers being sent on JSP pages after the response writer is invoked. {code} java.lang.IllegalStateException: STREAM at org.mortbay.jetty.Response.getWriter(Response.java:616) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver
[ https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651264#comment-14651264 ] Chris A. Mattmann commented on NUTCH-2062: -- {noformat} test: [echo] Testing plugin: urlnormalizer-slash [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/usr/local/Cellar/ant/1.9.4/libexec/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:file:/Users/mattmann/tmp/nutch-trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.nutch.net.urlnormalizer.regex.TestRegexURLNormalizer [junit] Running org.apache.nutch.net.urlnormalizer.slash.TestSlashURLNormalizer [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.79 sec [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.929 sec BUILD SUCCESSFUL Total time: 12 minutes 11 seconds [chipotle:~/tmp/nutch-trunk] mattmann% {noformat} All tests passing, commiting this now. Add Plugin for interacting with Selenium WebDriver -- Key: NUTCH-2062 URL: https://issues.apache.org/jira/browse/NUTCH-2062 Project: Nutch Issue Type: Improvement Components: plugin Affects Versions: 1.10 Reporter: Michael Joyce Assignee: Chris A. Mattmann Labels: memex Fix For: 1.11 Attachments: NUTCH-2062v2.patch The protocol-selenium plugin is great for pulling webpages that dynamically load content. However, I've run into use cases where I need to actively interact with a page in Selenium before it becomes useful. For instance, I may need to paginate through a table to get all results that I'm interested in. This plugin will handle that use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: NUTCH-2062 - Interactive Selenium Plugin
Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/46 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver
[ https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651266#comment-14651266 ] Chris A. Mattmann commented on NUTCH-2062: -- Thanks [~mjoyce]! All committed: {noformat} [chipotle:~/tmp/nutch-trunk] mattmann% svn commit -m Fix for NUTCH-2062: Add Plugin for interacting with Selenium WebDriver contributed by Michael Joyce mltjo...@gmail.com this closes #46 Sendingbuild.xml Sendingconf/nutch-default.xml Sendingsrc/plugin/build.xml Sending src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java Adding src/plugin/protocol-interactiveselenium Adding src/plugin/protocol-interactiveselenium/README.md Adding src/plugin/protocol-interactiveselenium/build-ivy.xml Adding src/plugin/protocol-interactiveselenium/build.xml Adding src/plugin/protocol-interactiveselenium/ivy.xml Adding src/plugin/protocol-interactiveselenium/plugin.xml Adding src/plugin/protocol-interactiveselenium/src Adding src/plugin/protocol-interactiveselenium/src/java Adding src/plugin/protocol-interactiveselenium/src/java/org Adding src/plugin/protocol-interactiveselenium/src/java/org/apache Adding src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch Adding src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol Adding src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium Adding src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/Http.java Adding src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/HttpResponse.java Adding src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers Adding src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefaultHandler.java Adding src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/InteractiveSeleniumHandler.java Adding src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/package.html Transmitting file data .. Committed revision 1693837. [chipotle:~/tmp/nutch-trunk] mattmann% {noformat} Add Plugin for interacting with Selenium WebDriver -- Key: NUTCH-2062 URL: https://issues.apache.org/jira/browse/NUTCH-2062 Project: Nutch Issue Type: Improvement Components: plugin Affects Versions: 1.10 Reporter: Michael Joyce Assignee: Chris A. Mattmann Labels: memex Fix For: 1.11 Attachments: NUTCH-2062v2.patch The protocol-selenium plugin is great for pulling webpages that dynamically load content. However, I've run into use cases where I need to actively interact with a page in Selenium before it becomes useful. For instance, I may need to paginate through a table to get all results that I'm interested in. This plugin will handle that use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Assigned] (NUTCH-2072) Deflate encoding support is broken when http.content.limit is set to -1
[ https://issues.apache.org/jira/browse/NUTCH-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann reassigned NUTCH-2072: Assignee: Chris A. Mattmann Deflate encoding support is broken when http.content.limit is set to -1 --- Key: NUTCH-2072 URL: https://issues.apache.org/jira/browse/NUTCH-2072 Project: Nutch Issue Type: Bug Components: plugin, protocol Reporter: Tanguy Moal Assignee: Chris A. Mattmann Priority: Minor The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is not designed to have sizeLimit set to a negative value. The fix can be simply to mimic what's done with gzip encoding : if {{getMaxContent() 0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver
[ https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651265#comment-14651265 ] ASF GitHub Bot commented on NUTCH-2062: --- Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/46 Add Plugin for interacting with Selenium WebDriver -- Key: NUTCH-2062 URL: https://issues.apache.org/jira/browse/NUTCH-2062 Project: Nutch Issue Type: Improvement Components: plugin Affects Versions: 1.10 Reporter: Michael Joyce Assignee: Chris A. Mattmann Labels: memex Fix For: 1.11 Attachments: NUTCH-2062v2.patch The protocol-selenium plugin is great for pulling webpages that dynamically load content. However, I've run into use cases where I need to actively interact with a page in Selenium before it becomes useful. For instance, I may need to paginate through a table to get all results that I'm interested in. This plugin will handle that use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work started] (NUTCH-2072) Deflate encoding support is broken when http.content.limit is set to -1
[ https://issues.apache.org/jira/browse/NUTCH-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2072 started by Chris A. Mattmann. Deflate encoding support is broken when http.content.limit is set to -1 --- Key: NUTCH-2072 URL: https://issues.apache.org/jira/browse/NUTCH-2072 Project: Nutch Issue Type: Bug Components: plugin, protocol Reporter: Tanguy Moal Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.11 The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is not designed to have sizeLimit set to a negative value. The fix can be simply to mimic what's done with gzip encoding : if {{getMaxContent() 0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2072) Deflate encoding support is broken when http.content.limit is set to -1
[ https://issues.apache.org/jira/browse/NUTCH-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-2072: - Fix Version/s: 1.11 Deflate encoding support is broken when http.content.limit is set to -1 --- Key: NUTCH-2072 URL: https://issues.apache.org/jira/browse/NUTCH-2072 Project: Nutch Issue Type: Bug Components: plugin, protocol Reporter: Tanguy Moal Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.11 The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is not designed to have sizeLimit set to a negative value. The fix can be simply to mimic what's done with gzip encoding : if {{getMaxContent() 0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2066) Parameterize Generate REST endpoint
[ https://issues.apache.org/jira/browse/NUTCH-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-2066: - Description: Allow user to specify crawldb and segment db in the Generate Job REST endpoint Parameterize Generate REST endpoint --- Key: NUTCH-2066 URL: https://issues.apache.org/jira/browse/NUTCH-2066 Project: Nutch Issue Type: Sub-task Components: REST_api Reporter: Sujen Shah Assignee: Chris A. Mattmann Priority: Minor Labels: memex Fix For: 1.11 Allow user to specify crawldb and segment db in the Generate Job REST endpoint -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Work started] (NUTCH-2066) Allow user to specify crawldb and segment db in the Generate JOb REST endpoint
[ https://issues.apache.org/jira/browse/NUTCH-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Work on NUTCH-2066 started by Chris A. Mattmann. Allow user to specify crawldb and segment db in the Generate JOb REST endpoint --- Key: NUTCH-2066 URL: https://issues.apache.org/jira/browse/NUTCH-2066 Project: Nutch Issue Type: Sub-task Components: REST_api Reporter: Sujen Shah Assignee: Chris A. Mattmann Priority: Minor Labels: memex Fix For: 1.11 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver
[ https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651295#comment-14651295 ] Hudson commented on NUTCH-2062: --- SUCCESS: Integrated in Nutch-trunk #3237 (See [https://builds.apache.org/job/Nutch-trunk/3237/]) Changes for NUTCH-2062 (mattmann: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1693838) * /nutch/trunk/CHANGES.txt Fix for NUTCH-2062: Add Plugin for interacting with Selenium WebDriver contributed by Michael Joyce mltjo...@gmail.com this closes #46 (mattmann: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1693837) * /nutch/trunk/build.xml * /nutch/trunk/conf/nutch-default.xml * /nutch/trunk/src/plugin/build.xml * /nutch/trunk/src/plugin/lib-selenium/src/java/org/apache/nutch/protocol/selenium/HttpWebClient.java * /nutch/trunk/src/plugin/protocol-interactiveselenium * /nutch/trunk/src/plugin/protocol-interactiveselenium/README.md * /nutch/trunk/src/plugin/protocol-interactiveselenium/build-ivy.xml * /nutch/trunk/src/plugin/protocol-interactiveselenium/build.xml * /nutch/trunk/src/plugin/protocol-interactiveselenium/ivy.xml * /nutch/trunk/src/plugin/protocol-interactiveselenium/plugin.xml * /nutch/trunk/src/plugin/protocol-interactiveselenium/src * /nutch/trunk/src/plugin/protocol-interactiveselenium/src/java * /nutch/trunk/src/plugin/protocol-interactiveselenium/src/java/org * /nutch/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache * /nutch/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch * /nutch/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol * /nutch/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium * /nutch/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/Http.java * /nutch/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/HttpResponse.java * /nutch/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers * /nutch/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefaultHandler.java * /nutch/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/InteractiveSeleniumHandler.java * /nutch/trunk/src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/package.html Add Plugin for interacting with Selenium WebDriver -- Key: NUTCH-2062 URL: https://issues.apache.org/jira/browse/NUTCH-2062 Project: Nutch Issue Type: Improvement Components: plugin Affects Versions: 1.10 Reporter: Michael Joyce Assignee: Chris A. Mattmann Labels: memex Fix For: 1.11 Attachments: NUTCH-2062v2.patch The protocol-selenium plugin is great for pulling webpages that dynamically load content. However, I've run into use cases where I need to actively interact with a page in Selenium before it becomes useful. For instance, I may need to paginate through a table to get all results that I'm interested in. This plugin will handle that use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2072) Deflate encoding support is broken when http.content.limit is set to -1
[ https://issues.apache.org/jira/browse/NUTCH-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651294#comment-14651294 ] Hudson commented on NUTCH-2072: --- SUCCESS: Integrated in Nutch-trunk #3237 (See [https://builds.apache.org/job/Nutch-trunk/3237/]) Fix for NUTCH-2072: Deflate encoding support is broken when http.content.limit is set to -1 contributed by Tanguy Moal tan...@cogniteev.com this closes #48. (mattmann: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1693843) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java Deflate encoding support is broken when http.content.limit is set to -1 --- Key: NUTCH-2072 URL: https://issues.apache.org/jira/browse/NUTCH-2072 Project: Nutch Issue Type: Bug Components: plugin, protocol Reporter: Tanguy Moal Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.11 The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is not designed to have sizeLimit set to a negative value. The fix can be simply to mimic what's done with gzip encoding : if {{getMaxContent() 0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2059) protocol-httpclient, protocol-http unit test errors on Jenkins
[ https://issues.apache.org/jira/browse/NUTCH-2059?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651315#comment-14651315 ] Chris A. Mattmann commented on NUTCH-2059: -- we have a failed build - https://builds.apache.org/job/Nutch-trunk/3238/testReport/junit/org.apache.nutch.fetcher/TestFetcher/testFetch/ related? protocol-httpclient, protocol-http unit test errors on Jenkins -- Key: NUTCH-2059 URL: https://issues.apache.org/jira/browse/NUTCH-2059 Project: Nutch Issue Type: Bug Components: fetcher Reporter: Peter Ciuffetti Assignee: Chris A. Mattmann Fix For: 1.11 This is an occasional error on the build of the Nutch trunk visible in Jenkins builds. It happens on either protocol-http or protocol-httpclient, which can be running at the same time given the multi-threaded test setup. {code} [junit] Running org.apache.nutch.protocol.httpclient.TestProtocolHttpClient [junit] Tests run: 2, Failures: 0, Errors: 1, Skipped: 0, Time elapsed: 2.377 sec [junit] Test org.apache.nutch.protocol.http.TestProtocolHttp FAILED {code} Evidence of failure on Jenkins go back to Failed Console Output #3154Jun 8, 2015 4:00:00 AM https://builds.apache.org/view/All/job/Nutch-trunk/3154/consoleFull And are repeated at... https://builds.apache.org/view/All/job/Nutch-trunk/3190/console https://builds.apache.org/view/All/job/Nutch-trunk/3189/console Some possibly related tickets NUTCH-1836 Timeouts in protocol-httpclient when crawling same host with 2 threads NUTCH-1086 Rewrite protocol-httpclient The unit tests are not failing for me on my sandbox, but there are some exceptions being output to the log related to headers being sent on JSP pages after the response writer is invoked. {code} java.lang.IllegalStateException: STREAM at org.mortbay.jetty.Response.getWriter(Response.java:616) {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2066) Parameterize Generate REST endpoint
[ https://issues.apache.org/jira/browse/NUTCH-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-2066: - Summary: Parameterize Generate REST endpoint (was: Allow user to specify crawldb and segment db in the Generate Job REST endpoint ) Parameterize Generate REST endpoint --- Key: NUTCH-2066 URL: https://issues.apache.org/jira/browse/NUTCH-2066 Project: Nutch Issue Type: Sub-task Components: REST_api Reporter: Sujen Shah Assignee: Chris A. Mattmann Priority: Minor Labels: memex Fix For: 1.11 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2066) Allow user to specify crawldb and segment db in the Generate Job REST endpoint
[ https://issues.apache.org/jira/browse/NUTCH-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-2066: - Summary: Allow user to specify crawldb and segment db in the Generate Job REST endpoint (was: Allow user to specify crawldb and segment db in the Generate JOb REST endpoint ) Allow user to specify crawldb and segment db in the Generate Job REST endpoint --- Key: NUTCH-2066 URL: https://issues.apache.org/jira/browse/NUTCH-2066 Project: Nutch Issue Type: Sub-task Components: REST_api Reporter: Sujen Shah Assignee: Chris A. Mattmann Priority: Minor Labels: memex Fix For: 1.11 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2062) Add Plugin for interacting with Selenium WebDriver
[ https://issues.apache.org/jira/browse/NUTCH-2062?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved NUTCH-2062. -- Resolution: Fixed Committed, thanks Mike! Add Plugin for interacting with Selenium WebDriver -- Key: NUTCH-2062 URL: https://issues.apache.org/jira/browse/NUTCH-2062 Project: Nutch Issue Type: Improvement Components: plugin Affects Versions: 1.10 Reporter: Michael Joyce Assignee: Chris A. Mattmann Labels: memex Fix For: 1.11 Attachments: NUTCH-2062v2.patch The protocol-selenium plugin is great for pulling webpages that dynamically load content. However, I've run into use cases where I need to actively interact with a page in Selenium before it becomes useful. For instance, I may need to paginate through a table to get all results that I'm interested in. This plugin will handle that use case. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Comment Edited] (NUTCH-2072) Deflate encoding support is broken when http.content.limit is set to -1
[ https://issues.apache.org/jira/browse/NUTCH-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651279#comment-14651279 ] Chris A. Mattmann edited comment on NUTCH-2072 at 8/2/15 11:39 PM: --- Fixed, thanks [~tanguy]! {noformat} [chipotle:~/tmp/nutch-trunk] mattmann% svn commit -m Fix for NUTCH-2072: Deflate encoding support is broken when http.content.limit is set to -1 contributed by Tanguy Moal tan...@cogniteev.com this closes #48. SendingCHANGES.txt Sending src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java Transmitting file data .. Committed revision 1693843. [chipotle:~/tmp/nutch-trunk] mattmann% {noformat} was (Author: chrismattmann): Fixed, thanks [~ltanguy] {noformat} [chipotle:~/tmp/nutch-trunk] mattmann% svn commit -m Fix for NUTCH-2072: Deflate encoding support is broken when http.content.limit is set to -1 contributed by Tanguy Moal tan...@cogniteev.com this closes #48. SendingCHANGES.txt Sending src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java Transmitting file data .. Committed revision 1693843. [chipotle:~/tmp/nutch-trunk] mattmann% {noformat} Deflate encoding support is broken when http.content.limit is set to -1 --- Key: NUTCH-2072 URL: https://issues.apache.org/jira/browse/NUTCH-2072 Project: Nutch Issue Type: Bug Components: plugin, protocol Reporter: Tanguy Moal Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.11 The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is not designed to have sizeLimit set to a negative value. The fix can be simply to mimic what's done with gzip encoding : if {{getMaxContent() 0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Updated] (NUTCH-2066) Allow user to specify crawldb and segment db in the Generate Job REST endpoint
[ https://issues.apache.org/jira/browse/NUTCH-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann updated NUTCH-2066: - Labels: memex (was: ) Allow user to specify crawldb and segment db in the Generate Job REST endpoint --- Key: NUTCH-2066 URL: https://issues.apache.org/jira/browse/NUTCH-2066 Project: Nutch Issue Type: Sub-task Components: REST_api Reporter: Sujen Shah Assignee: Chris A. Mattmann Priority: Minor Labels: memex Fix For: 1.11 -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2072) Deflate encoding support is broken when http.content.limit is set to -1
[ https://issues.apache.org/jira/browse/NUTCH-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved NUTCH-2072. -- Resolution: Fixed Fixed, thanks [~ltanguy] {noformat} [chipotle:~/tmp/nutch-trunk] mattmann% svn commit -m Fix for NUTCH-2072: Deflate encoding support is broken when http.content.limit is set to -1 contributed by Tanguy Moal tan...@cogniteev.com this closes #48. SendingCHANGES.txt Sending src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpBase.java Transmitting file data .. Committed revision 1693843. [chipotle:~/tmp/nutch-trunk] mattmann% {noformat} Deflate encoding support is broken when http.content.limit is set to -1 --- Key: NUTCH-2072 URL: https://issues.apache.org/jira/browse/NUTCH-2072 Project: Nutch Issue Type: Bug Components: plugin, protocol Reporter: Tanguy Moal Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.11 The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is not designed to have sizeLimit set to a negative value. The fix can be simply to mimic what's done with gzip encoding : if {{getMaxContent() 0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[GitHub] nutch pull request: Fix for NUTCH-2066 contributed by Sujen Shah
Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/47 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[jira] [Commented] (NUTCH-2066) Parameterize Generate REST endpoint
[ https://issues.apache.org/jira/browse/NUTCH-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651297#comment-14651297 ] Chris A. Mattmann commented on NUTCH-2066: -- All tests pass: {noformat} test: [echo] Testing plugin: urlnormalizer-slash [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/usr/local/Cellar/ant/1.9.4/libexec/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:file:/Users/mattmann/tmp/nutch-trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.886 sec [junit] Running org.apache.nutch.net.urlnormalizer.slash.TestSlashURLNormalizer [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 0.552 sec BUILD SUCCESSFUL Total time: 11 minutes 11 seconds [chipotle:~/tmp/nutch-trunk] mattmann% {noformat} Parameterize Generate REST endpoint --- Key: NUTCH-2066 URL: https://issues.apache.org/jira/browse/NUTCH-2066 Project: Nutch Issue Type: Sub-task Components: REST_api Reporter: Sujen Shah Assignee: Chris A. Mattmann Priority: Minor Labels: memex Fix For: 1.11 Allow user to specify crawldb and segment db in the Generate Job REST endpoint -- This message was sent by Atlassian JIRA (v6.3.4#6332)
Build failed in Jenkins: Nutch-trunk #3238
See https://builds.apache.org/job/Nutch-trunk/3238/changes Changes: [mattmann] Fix for NUTCH-2066: Parameterize Generate REST endpoint contributed by Sujen Shah sujen1...@gmail.com this closes #47. -- [...truncated 4272 lines...] [copy] Copying 1 file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-host copy-generated-lib: [copy] Copying 1 file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-host init: [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/classes [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/test [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/test/lib [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-pass init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-pass [javac] Compiling 2 source files to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/classes [javac] Creating empty https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/classes/org/apache/nutch/net/urlnormalizer/pass/package-info.class jar: [jar] Building jar: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-pass/urlnormalizer-pass.jar deps-test: deploy: [copy] Copying 1 file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-pass copy-generated-lib: [copy] Copying 1 file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-pass init: [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-querystring [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-querystring/classes [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-querystring/test [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-querystring/test/lib [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-querystring init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-querystring [javac] Compiling 2 source files to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-querystring/classes [javac] Creating empty https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-querystring/classes/org/apache/nutch/net/urlnormalizer/querystring/package-info.class jar: [jar] Building jar: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-querystring/urlnormalizer-querystring.jar deps-test: deploy: [copy] Copying 1 file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-querystring copy-generated-lib: [copy] Copying 1 file to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-querystring [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-regex/test/data [copy] Copying 4 files to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-regex/test/data init: [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-regex/classes [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-regex/test/lib [mkdir] Created dir: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/plugins/urlnormalizer-regex init-plugin: deps-jar: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile: [echo] Compiling plugin: urlnormalizer-regex [javac] Compiling 2 source files to https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-regex/classes [javac] Creating empty https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-regex/classes/org/apache/nutch/net/urlnormalizer/regex/package-info.class jar: [jar] Building jar: https://builds.apache.org/job/Nutch-trunk/ws/trunk/build/urlnormalizer-regex/urlnormalizer-regex.jar deps-test: init: init-plugin: clean-lib: resolve-default: [ivy:resolve] :: loading settings :: file = https://builds.apache.org/job/Nutch-trunk/ws/trunk/ivy/ivysettings.xml compile:
[jira] [Commented] (NUTCH-2066) Parameterize Generate REST endpoint
[ https://issues.apache.org/jira/browse/NUTCH-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651314#comment-14651314 ] Hudson commented on NUTCH-2066: --- FAILURE: Integrated in Nutch-trunk #3238 (See [https://builds.apache.org/job/Nutch-trunk/3238/]) Fix for NUTCH-2066: Parameterize Generate REST endpoint contributed by Sujen Shah sujen1...@gmail.com this closes #47. (mattmann: http://svn.apache.org/viewvc/nutch/trunk/?view=revrev=1693844) * /nutch/trunk/CHANGES.txt * /nutch/trunk/src/java/org/apache/nutch/crawl/Generator.java Parameterize Generate REST endpoint --- Key: NUTCH-2066 URL: https://issues.apache.org/jira/browse/NUTCH-2066 Project: Nutch Issue Type: Sub-task Components: REST_api Reporter: Sujen Shah Assignee: Chris A. Mattmann Priority: Minor Labels: memex Fix For: 1.11 Allow user to specify crawldb and segment db in the Generate Job REST endpoint -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2072) Deflate encoding support is broken when http.content.limit is set to -1
[ https://issues.apache.org/jira/browse/NUTCH-2072?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651277#comment-14651277 ] Chris A. Mattmann commented on NUTCH-2072: -- Tests pass: {noformat} copy-generated-lib: test: [echo] Testing plugin: urlnormalizer-slash [junit] WARNING: multiple versions of ant detected in path for junit [junit] jar:file:/usr/local/Cellar/ant/1.9.4/libexec/lib/ant.jar!/org/apache/tools/ant/Project.class [junit] and jar:file:/Users/mattmann/tmp/nutch-trunk/build/test/lib/ant-1.6.5.jar!/org/apache/tools/ant/Project.class [junit] Running org.apache.nutch.net.urlnormalizer.slash.TestSlashURLNormalizer [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 2.055 sec [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 11.856 sec [junit] Running org.apache.nutch.tika.TestRTFParser [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 1, Time elapsed: 0.125 sec [junit] Running org.apache.nutch.tika.TestRobotsMetaProcessor [junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 17.994 sec BUILD SUCCESSFUL Total time: 13 minutes 21 seconds {noformat} Committing this now. Thanks. Deflate encoding support is broken when http.content.limit is set to -1 --- Key: NUTCH-2072 URL: https://issues.apache.org/jira/browse/NUTCH-2072 Project: Nutch Issue Type: Bug Components: plugin, protocol Reporter: Tanguy Moal Assignee: Chris A. Mattmann Priority: Minor Fix For: 1.11 The method {{DeflateUtils.inflateBestEffort(byte[] in, int sizeLimit)}} is not designed to have sizeLimit set to a negative value. The fix can be simply to mimic what's done with gzip encoding : if {{getMaxContent() 0}} then use {{Integer.MAX_VALUE}} for the {{sizeLimit}} argument. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (NUTCH-2066) Parameterize Generate REST endpoint
[ https://issues.apache.org/jira/browse/NUTCH-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14651300#comment-14651300 ] ASF GitHub Bot commented on NUTCH-2066: --- Github user asfgit closed the pull request at: https://github.com/apache/nutch/pull/47 Parameterize Generate REST endpoint --- Key: NUTCH-2066 URL: https://issues.apache.org/jira/browse/NUTCH-2066 Project: Nutch Issue Type: Sub-task Components: REST_api Reporter: Sujen Shah Assignee: Chris A. Mattmann Priority: Minor Labels: memex Fix For: 1.11 Allow user to specify crawldb and segment db in the Generate Job REST endpoint -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (NUTCH-2066) Parameterize Generate REST endpoint
[ https://issues.apache.org/jira/browse/NUTCH-2066?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Chris A. Mattmann resolved NUTCH-2066. -- Resolution: Fixed Committed to trunk: {noformat} [chipotle:~/tmp/nutch-trunk] mattmann% svn commit -m Fix for NUTCH-2066: Parameterize Generate REST endpoint contributed by Sujen Shah sujen1...@gmail.com this closes #47. SendingCHANGES.txt Sendingsrc/java/org/apache/nutch/crawl/Generator.java Transmitting file data .. Committed revision 1693844. [chipotle:~/tmp/nutch-trunk] mattmann% {noformat} Parameterize Generate REST endpoint --- Key: NUTCH-2066 URL: https://issues.apache.org/jira/browse/NUTCH-2066 Project: Nutch Issue Type: Sub-task Components: REST_api Reporter: Sujen Shah Assignee: Chris A. Mattmann Priority: Minor Labels: memex Fix For: 1.11 Allow user to specify crawldb and segment db in the Generate Job REST endpoint -- This message was sent by Atlassian JIRA (v6.3.4#6332)