[ https://issues.apache.org/jira/browse/NUTCH-2993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17733442#comment-17733442 ]
Sebastian Nagel commented on NUTCH-2993: ---------------------------------------- Hi [~markus17], the patch actually applies to master. So no need to create a specific one. Few comments: - because both pattern matching and URL.toString() are computationally not trivial (an integer comparison is): maybe do the skip depth check only if {{curDepth >= curMaxDepth}}. This saves computation time for all pages which are below {{curMaxDepth}}. - [Configuration.get(name)|https://hadoop.apache.org/docs/stable/api/org/apache/hadoop/conf/Configuration.html#get-java.lang.String-] returns null as default value, no need to pass it. But maybe pass an empty string as default value and adapt the check. - add the property to nutch-default.xml with a description - maybe catch potential exceptions in {{Pattern.compile(...)}} and log them as error. It comes from the configuration and a invalid pattern would otherwise stop the crawl. - typo in variable name {{depthOverrridePattern}} > ScoringDepth plugin to skip depth check based on URL Pattern > ------------------------------------------------------------ > > Key: NUTCH-2993 > URL: https://issues.apache.org/jira/browse/NUTCH-2993 > Project: Nutch > Issue Type: Improvement > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Priority: Minor > Fix For: 1.20 > > Attachments: NUTCH-2993-1.15.patch > > > We do not want some crawl to go deep and broad, but instead focus it on a > narrow section of sites. This patch skips the depth check if the current URL > matches some regular expression. > > Initially we tried to set a custom maxDepth based on a Pattern match, but > this didn't work. The crawler still managed to creep too deep due to having > links everywhere. -- This message was sent by Atlassian Jira (v8.20.10#820010)