[
https://issues.apache.org/jira/browse/NUTCH-3107?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18004767#comment-18004767
]
Sebastian Nagel commented on NUTCH-3107:
----------------------------------------
Hi [~markus], a very late review...
thanks! A useful feature and reasonable use case.
- I can apply NUTCH-3107-1.patch, but applying NUTCH-3107-2.patch does not
succeed. Neither a clean checkout, nor on top of the first patch. Could you
elaborate how to deal with the second patch?
- also: NUTCH-3107-2.patch introduces two properties which are not documented
in nutch-default.xml
- two patch files include a change in FetchOverdueCrawlDatumProcessor - I
assume that's unintended.
- since the default rules are empty, maybe do not add urlnormalizer-querystring
to the default value of {{plugin.includes}}. It still sorts the URL query
parameters, just to avoid unintended changes.
- ideally, should extend the description in package-info.java to include the
new feature
> QueryString normalizer to support per-host removal of qstr params
> -----------------------------------------------------------------
>
> Key: NUTCH-3107
> URL: https://issues.apache.org/jira/browse/NUTCH-3107
> Project: Nutch
> Issue Type: Improvement
> Components: urlnormalizer
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Minor
> Fix For: 1.21
>
> Attachments: NUTCH-3107-1.patch, NUTCH-3107-2.patch, NUTCH-3107.patch
>
>
> QueryString Normalizer now only does sorting of query string key/value pairs.
> It could also support removal of per-host configurable keys.
> Normally this can be done in normalizer regex, but having a few million XML
> entries in the config parsed everytime, and millions of regular expressions
> executed is not very convenient.
> Updated patch also adds support for global ignorable params, and some other
> checks on query string keys.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)