Jenkins build is back to normal : Nutch » Nutch-trunk #104

2023-08-22 Thread Apache Jenkins Server
See 




[jira] [Commented] (NUTCH-2997) Add Override annotations where applicable

2023-08-22 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757333#comment-17757333
 ] 

Hudson commented on NUTCH-2997:
---

FAILURE: Integrated in Jenkins build Nutch » Nutch-trunk #103 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/103/])
NUTCH-2997 Add Override annotations (snagel: 
[https://github.com/apache/nutch/commit/0fae6b59fd85f2ec894a28089c1d086b2604660a])
* (edit) src/java/org/apache/nutch/hostdb/ResolverThread.java
* (edit) src/test/org/apache/nutch/plugin/SimpleTestPlugin.java
* (edit) src/java/org/apache/nutch/parse/ParseImpl.java
* (edit) src/java/org/apache/nutch/parse/ParseOutputFormat.java
* (edit) src/java/org/apache/nutch/service/impl/SeedManagerImpl.java
* (edit) src/java/org/apache/nutch/fetcher/QueueFeeder.java
* (edit) src/java/org/apache/nutch/tools/CommonCrawlFormat.java
* (edit) src/java/org/apache/nutch/metadata/MetaWrapper.java
* (edit) 
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpAuthenticationFactory.java
* (edit) 
src/plugin/urlfilter-automaton/src/test/org/apache/nutch/urlfilter/automaton/TestAutomatonURLFilter.java
* (edit) src/java/org/apache/nutch/crawl/MD5Signature.java
* (edit) src/java/org/apache/nutch/plugin/PluginRepository.java
* (edit) src/java/org/apache/nutch/service/resources/AdminResource.java
* (edit) 
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/HttpResponse.java
* (edit) src/java/org/apache/nutch/parse/ParseData.java
* (edit) 
src/plugin/protocol-htmlunit/src/java/org/apache/nutch/protocol/htmlunit/DummyX509TrustManager.java
* (edit) 
src/plugin/protocol-ftp/src/java/org/apache/nutch/protocol/ftp/PrintCommandListener.java
* (edit) src/java/org/apache/nutch/parse/ParseText.java
* (edit) src/java/org/apache/nutch/util/EncodingDetector.java
* (edit) 
src/plugin/urlfilter-fast/src/java/org/apache/nutch/urlfilter/fast/FastURLFilter.java
* (edit) src/java/org/apache/nutch/tools/ResolveUrls.java
* (edit) src/java/org/apache/nutch/fetcher/FetcherOutputFormat.java
* (edit) src/java/org/apache/nutch/util/CommandRunner.java
* (edit) 
src/plugin/lib-http/src/java/org/apache/nutch/protocol/http/api/HttpRobotRulesParser.java
* (edit) src/java/org/apache/nutch/crawl/AdaptiveFetchSchedule.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/Node.java
* (edit) src/java/org/apache/nutch/crawl/Signature.java
* (edit) src/java/org/apache/nutch/tools/CommonCrawlFormatSimple.java
* (edit) src/java/org/apache/nutch/tools/DmozParser.java
* (edit) src/java/org/apache/nutch/tools/arc/ArcInputFormat.java
* (edit) src/test/org/apache/nutch/crawl/CrawlDbUpdateUtil.java
* (edit) src/java/org/apache/nutch/metadata/Metadata.java
* (edit) src/java/org/apache/nutch/util/SuffixStringMatcher.java
* (edit) src/java/org/apache/nutch/scoring/webgraph/LinkDumper.java
* (edit) src/java/org/apache/nutch/service/impl/ConfManagerImpl.java
* (edit) 
src/plugin/parse-html/src/java/org/apache/nutch/parse/html/DOMContentUtils.java
* (edit) 
src/plugin/urlfilter-fast/src/test/org/apache/nutch/urlfilter/fast/TestFastURLFilter.java
* (edit) src/java/org/apache/nutch/protocol/Content.java
* (edit) src/java/org/apache/nutch/util/PrefixStringMatcher.java
* (edit) src/java/org/apache/nutch/parse/ParserChecker.java
* (edit) src/java/org/apache/nutch/segment/SegmentPart.java
* (edit) src/java/org/apache/nutch/parse/ParseStatus.java
* (edit) src/java/org/apache/nutch/crawl/TextMD5Signature.java
* (edit) 
src/plugin/protocol-selenium/src/java/org/apache/nutch/protocol/selenium/HttpResponse.java
* (edit) 
src/plugin/parse-tika/src/java/org/apache/nutch/parse/tika/DOMContentUtils.java
* (edit) src/test/org/apache/nutch/crawl/TestAdaptiveFetchSchedule.java
* (edit) src/java/org/apache/nutch/scoring/ScoringFilters.java
* (edit) 
src/plugin/urlfilter-regex/src/test/org/apache/nutch/urlfilter/regex/TestRegexURLFilter.java
* (edit) 
src/plugin/protocol-interactiveselenium/src/java/org/apache/nutch/protocol/interactiveselenium/handlers/DefalultMultiInteractionHandler.java
* (edit) 
src/plugin/protocol-httpclient/src/java/org/apache/nutch/protocol/httpclient/HttpResponse.java
* (edit) src/test/org/apache/nutch/crawl/TestGenerator.java
* (edit) src/java/org/apache/nutch/protocol/ProtocolStatus.java
* (edit) src/java/org/apache/nutch/crawl/AbstractFetchSchedule.java
* (edit) src/test/org/apache/nutch/crawl/CrawlDBTestUtil.java
* (edit) src/java/org/apache/nutch/segment/SegmentMerger.java
* (edit) 
src/plugin/indexer-csv/src/java/org/apache/nutch/indexwriter/csv/CSVIndexWriter.java
* (edit) 
src/plugin/urlnormalizer-regex/src/java/org/apache/nutch/net/urlnormalizer/regex/RegexURLNormalizer.java
* (edit) src/java/org/apache/nutch/net/URLNormalizerChecker.java
* (edit) 
src/plugin/urlnormalizer-regex/src/test/org/apache/nutch/net/urlnormalizer/regex/TestRegexURLNormalizer.java
* (edit) 

[jira] [Commented] (NUTCH-2996) Use new SimpleRobotRulesParser API entry point (crawler-commons 1.4)

2023-08-22 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757332#comment-17757332
 ] 

Hudson commented on NUTCH-2996:
---

FAILURE: Integrated in Jenkins build Nutch » Nutch-trunk #103 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/103/])
NUTCH-2996 Use new SimpleRobotRulesParser API entry point crawler-commons 1.4 
(github: 
[https://github.com/apache/nutch/commit/070c115cfadbc937a8ad0add6447461983e92028])
* (edit) src/java/org/apache/nutch/protocol/RobotRulesParser.java
* (edit) 
src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java
* (edit) conf/nutch-default.xml


> Use new SimpleRobotRulesParser API entry point (crawler-commons 1.4)
> 
>
> Key: NUTCH-2996
> URL: https://issues.apache.org/jira/browse/NUTCH-2996
> Project: Nutch
>  Issue Type: Improvement
>  Components: robots
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Crawler-commons 1.4 (#1085) robots.txt parser (SimpleRobotRulesParser) 
> introduces a new [API entry point to parse the robots.txt 
> content|https://crawler-commons.github.io/crawler-commons/1.4/crawlercommons/robots/SimpleRobotRulesParser.html#parseContent(java.lang.String,byte%5B%5D,java.lang.String,java.util.Collection)]:
> - it's more efficient by accepting a collection of lower-cased, single-word 
> user-agent product tokens, without the need to tokenize a (comma-separated) 
> list of user-agent strings again with every robots.txt
> - user-agent matching is compliant with [RFC 9309 (section 
> 2.2.1)|https://www.rfc-editor.org/rfc/rfc9309.html#name-the-user-agent-line] 
> only if the new API method is used



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


Build failed in Jenkins: Nutch » Nutch-trunk #103

2023-08-22 Thread Apache Jenkins Server
See 


Changes:

[github] NUTCH-2996 Use new SimpleRobotRulesParser API entry point 
crawler-commons 1.4

[Sebastian Nagel] NUTCH-2997 Add Override annotations


--
[...truncated 825.45 KB...]
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-basic

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 


jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-basic
[junit] Running 
org.apache.nutch.net.urlnormalizer.ajax.TestAjaxURLNormalizer
[junit] Tests run: 2, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.126 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-host

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 


jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-host
[junit] Running 
org.apache.nutch.net.urlnormalizer.basic.TestBasicURLNormalizer
[junit] Tests run: 6, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.515 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-pass

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 


jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-pass
[junit] Running 
org.apache.nutch.net.urlnormalizer.host.TestHostURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.53 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-protocol

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 

[junit] Running 
org.apache.nutch.net.urlnormalizer.pass.TestPassURLNormalizer

jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-protocol
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.334 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-querystring

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 


jar:

deps-test:

deploy:

copy-generated-lib:

test:
 [echo] Testing plugin: urlnormalizer-querystring
[junit] Running 
org.apache.nutch.net.urlnormalizer.protocol.TestProtocolURLNormalizer
[junit] Running 
org.apache.nutch.net.urlnormalizer.querystring.TestQuerystringURLNormalizer
[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
0.307 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-regex

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 

[junit] Tests run: 1, Failures: 0, Errors: 0, Skipped: 0, Time elapsed: 
1.014 sec

init:

init-plugin:

deps-jar:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:
 [echo] Compiling plugin: urlnormalizer-slash

deps-test-compile:

compile-test:
[javac] Compiling 1 source file to 


jar:

deps-test:

init:

init-plugin:

clean-lib:

resolve-default:
[ivy:resolve] :: loading settings :: file = 


compile:

jar:

deps-test:

deploy:

copy-generated-lib:

deploy:


[jira] [Resolved] (NUTCH-2997) Add Override annotations where applicable

2023-08-22 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2997.

Resolution: Implemented

> Add Override annotations where applicable
> -
>
> Key: NUTCH-2997
> URL: https://issues.apache.org/jira/browse/NUTCH-2997
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.20
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2997) Add Override annotations where applicable

2023-08-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757320#comment-17757320
 ] 

ASF GitHub Bot commented on NUTCH-2997:
---

sebastian-nagel merged PR #767:
URL: https://github.com/apache/nutch/pull/767




> Add Override annotations where applicable
> -
>
> Key: NUTCH-2997
> URL: https://issues.apache.org/jira/browse/NUTCH-2997
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.20
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [nutch] sebastian-nagel merged pull request #767: NUTCH-2997 Add Override annotations

2023-08-22 Thread via GitHub


sebastian-nagel merged PR #767:
URL: https://github.com/apache/nutch/pull/767


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Assigned] (NUTCH-2997) Add Override annotations where applicable

2023-08-22 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel reassigned NUTCH-2997:
--

Assignee: Sebastian Nagel

> Add Override annotations where applicable
> -
>
> Key: NUTCH-2997
> URL: https://issues.apache.org/jira/browse/NUTCH-2997
> Project: Nutch
>  Issue Type: Improvement
>Affects Versions: 1.19
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Trivial
> Fix For: 1.20
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2996) Use new SimpleRobotRulesParser API entry point (crawler-commons 1.4)

2023-08-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757318#comment-17757318
 ] 

ASF GitHub Bot commented on NUTCH-2996:
---

sebastian-nagel merged PR #766:
URL: https://github.com/apache/nutch/pull/766




> Use new SimpleRobotRulesParser API entry point (crawler-commons 1.4)
> 
>
> Key: NUTCH-2996
> URL: https://issues.apache.org/jira/browse/NUTCH-2996
> Project: Nutch
>  Issue Type: Improvement
>  Components: robots
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Crawler-commons 1.4 (#1085) robots.txt parser (SimpleRobotRulesParser) 
> introduces a new [API entry point to parse the robots.txt 
> content|https://crawler-commons.github.io/crawler-commons/1.4/crawlercommons/robots/SimpleRobotRulesParser.html#parseContent(java.lang.String,byte%5B%5D,java.lang.String,java.util.Collection)]:
> - it's more efficient by accepting a collection of lower-cased, single-word 
> user-agent product tokens, without the need to tokenize a (comma-separated) 
> list of user-agent strings again with every robots.txt
> - user-agent matching is compliant with [RFC 9309 (section 
> 2.2.1)|https://www.rfc-editor.org/rfc/rfc9309.html#name-the-user-agent-line] 
> only if the new API method is used



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [nutch] sebastian-nagel merged pull request #766: NUTCH-2996 Use new SimpleRobotRulesParser API entry point crawler-commons 1.4

2023-08-22 Thread via GitHub


sebastian-nagel merged PR #766:
URL: https://github.com/apache/nutch/pull/766


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Commented] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-08-22 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757316#comment-17757316
 ] 

Hudson commented on NUTCH-2993:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #102 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/102/])
NUTCH-2993 ScoringDepth plugin to skip depth check based on URL Pattern 
(snagel: 
[https://github.com/apache/nutch/commit/eae3c52a8140344dff46c448664a2467d631cefc])
* (edit) 
src/plugin/scoring-depth/src/java/org/apache/nutch/scoring/depth/DepthScoringFilter.java
* (edit) conf/nutch-default.xml


> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, scoring
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15-1.patch, NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites.
> This patch overrides maxDepth for outlinks of URLs matching a configured 
> pattern. URL not matching the pattern get the default max depth value 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2995) Upgrade to crawler-commons 1.4

2023-08-22 Thread Hudson (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757315#comment-17757315
 ] 

Hudson commented on NUTCH-2995:
---

SUCCESS: Integrated in Jenkins build Nutch » Nutch-trunk #102 (See 
[https://ci-builds.apache.org/job/Nutch/job/Nutch-trunk/102/])
NUTCH-2995 Upgrade to crawler-commons 1.4 (github: 
[https://github.com/apache/nutch/commit/a24ec5c5b761476897c7fff0bfd3d5107995fedc])
* (edit) 
src/plugin/lib-http/src/test/org/apache/nutch/protocol/http/api/TestRobotRulesParser.java
* (edit) ivy/ivy.xml


> Upgrade to crawler-commons 1.4
> --
>
> Key: NUTCH-2995
> URL: https://issues.apache.org/jira/browse/NUTCH-2995
> Project: Nutch
>  Issue Type: Improvement
>  Components: robots
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2996) Use new SimpleRobotRulesParser API entry point (crawler-commons 1.4)

2023-08-22 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2996.

Resolution: Implemented

> Use new SimpleRobotRulesParser API entry point (crawler-commons 1.4)
> 
>
> Key: NUTCH-2996
> URL: https://issues.apache.org/jira/browse/NUTCH-2996
> Project: Nutch
>  Issue Type: Improvement
>  Components: robots
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Crawler-commons 1.4 (#1085) robots.txt parser (SimpleRobotRulesParser) 
> introduces a new [API entry point to parse the robots.txt 
> content|https://crawler-commons.github.io/crawler-commons/1.4/crawlercommons/robots/SimpleRobotRulesParser.html#parseContent(java.lang.String,byte%5B%5D,java.lang.String,java.util.Collection)]:
> - it's more efficient by accepting a collection of lower-cased, single-word 
> user-agent product tokens, without the need to tokenize a (comma-separated) 
> list of user-agent strings again with every robots.txt
> - user-agent matching is compliant with [RFC 9309 (section 
> 2.2.1)|https://www.rfc-editor.org/rfc/rfc9309.html#name-the-user-agent-line] 
> only if the new API method is used



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2996) Use new SimpleRobotRulesParser API entry point (crawler-commons 1.4)

2023-08-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757255#comment-17757255
 ] 

ASF GitHub Bot commented on NUTCH-2996:
---

sebastian-nagel commented on PR #766:
URL: https://github.com/apache/nutch/pull/766#issuecomment-1687747418

   Rebased on top of current master which already includes NUTCH-2995.




> Use new SimpleRobotRulesParser API entry point (crawler-commons 1.4)
> 
>
> Key: NUTCH-2996
> URL: https://issues.apache.org/jira/browse/NUTCH-2996
> Project: Nutch
>  Issue Type: Improvement
>  Components: robots
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>
> Crawler-commons 1.4 (#1085) robots.txt parser (SimpleRobotRulesParser) 
> introduces a new [API entry point to parse the robots.txt 
> content|https://crawler-commons.github.io/crawler-commons/1.4/crawlercommons/robots/SimpleRobotRulesParser.html#parseContent(java.lang.String,byte%5B%5D,java.lang.String,java.util.Collection)]:
> - it's more efficient by accepting a collection of lower-cased, single-word 
> user-agent product tokens, without the need to tokenize a (comma-separated) 
> list of user-agent strings again with every robots.txt
> - user-agent matching is compliant with [RFC 9309 (section 
> 2.2.1)|https://www.rfc-editor.org/rfc/rfc9309.html#name-the-user-agent-line] 
> only if the new API method is used



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [nutch] sebastian-nagel commented on pull request #766: NUTCH-2996 Use new SimpleRobotRulesParser API entry point crawler-commons 1.4

2023-08-22 Thread via GitHub


sebastian-nagel commented on PR #766:
URL: https://github.com/apache/nutch/pull/766#issuecomment-1687747418

   Rebased on top of current master which already includes NUTCH-2995.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Resolved] (NUTCH-2995) Upgrade to crawler-commons 1.4

2023-08-22 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2995?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2995.

Resolution: Implemented

> Upgrade to crawler-commons 1.4
> --
>
> Key: NUTCH-2995
> URL: https://issues.apache.org/jira/browse/NUTCH-2995
> Project: Nutch
>  Issue Type: Improvement
>  Components: robots
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2995) Upgrade to crawler-commons 1.4

2023-08-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2995?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757253#comment-17757253
 ] 

ASF GitHub Bot commented on NUTCH-2995:
---

sebastian-nagel merged PR #765:
URL: https://github.com/apache/nutch/pull/765




> Upgrade to crawler-commons 1.4
> --
>
> Key: NUTCH-2995
> URL: https://issues.apache.org/jira/browse/NUTCH-2995
> Project: Nutch
>  Issue Type: Improvement
>  Components: robots
>Affects Versions: 1.20
>Reporter: Sebastian Nagel
>Assignee: Sebastian Nagel
>Priority: Major
> Fix For: 1.20
>
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [nutch] sebastian-nagel merged pull request #765: NUTCH-2995 Upgrade to crawler-commons 1.4

2023-08-22 Thread via GitHub


sebastian-nagel merged PR #765:
URL: https://github.com/apache/nutch/pull/765


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[jira] [Updated] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-08-22 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel updated NUTCH-2993:
---
  Component/s: plugin
   scoring
Affects Version/s: 1.19

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, scoring
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15-1.patch, NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites.
> This patch overrides maxDepth for outlinks of URLs matching a configured 
> pattern. URL not matching the pattern get the default max depth value 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-08-22 Thread Sebastian Nagel (Jira)


 [ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Sebastian Nagel resolved NUTCH-2993.

Resolution: Implemented

Committed/merged. Thanks, [~markus17]!

> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>  Components: plugin, scoring
>Affects Versions: 1.19
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15-1.patch, NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites.
> This patch overrides maxDepth for outlinks of URLs matching a configured 
> pattern. URL not matching the pattern get the default max depth value 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (NUTCH-2993) ScoringDepth plugin to skip depth check based on URL Pattern

2023-08-22 Thread ASF GitHub Bot (Jira)


[ 
https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17757250#comment-17757250
 ] 

ASF GitHub Bot commented on NUTCH-2993:
---

sebastian-nagel merged PR #764:
URL: https://github.com/apache/nutch/pull/764




> ScoringDepth plugin to skip depth check based on URL Pattern
> 
>
> Key: NUTCH-2993
> URL: https://issues.apache.org/jira/browse/NUTCH-2993
> Project: Nutch
>  Issue Type: Improvement
>Reporter: Markus Jelsma
>Assignee: Markus Jelsma
>Priority: Minor
> Fix For: 1.20
>
> Attachments: NUTCH-2993-1.15-1.patch, NUTCH-2993-1.15.patch
>
>
> We do not want some crawl to go deep and broad, but instead focus it on a 
> narrow section of sites.
> This patch overrides maxDepth for outlinks of URLs matching a configured 
> pattern. URL not matching the pattern get the default max depth value 
> configured.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[GitHub] [nutch] sebastian-nagel merged pull request #764: NUTCH-2993 ScoringDepth plugin to skip depth check based on URL Pattern

2023-08-22 Thread via GitHub


sebastian-nagel merged PR #764:
URL: https://github.com/apache/nutch/pull/764


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: dev-unsubscr...@nutch.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org