[
https://issues.apache.org/jira/browse/NUTCH-1969?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14514187#comment-14514187
]
Chris A. Mattmann commented on NUTCH-1969:
------------------------------------------
Thanks Markus, seen. Someone marked it for 1.10 and I was trying to get all the
1.10 stuff in so that we can make a release. No worries about it not being
fully done. That way we've got a copy of it in trunk and in the release, and
that way it will be easier to make improvements since they will be more
incremental. Feel free to open new issues to enhance this in e.g., 1.11 and I
think we're good. Cheers.
> URL Normalizer properly handling slashes
> ----------------------------------------
>
> Key: NUTCH-1969
> URL: https://issues.apache.org/jira/browse/NUTCH-1969
> Project: Nutch
> Issue Type: New Feature
> Components: plugin
> Affects Versions: 1.9
> Reporter: Markus Jelsma
> Assignee: Chris A. Mattmann
> Fix For: 1.10
>
> Attachments: NUTCH-1969.patch
>
>
> This is a URL normalizer we use that is simple to use and generate for
> dealing with hosts that mix up slash suffixed URL's with non-slash suffixed
> URL's.
> It is similar to the host nomalizer, reducing the number of duplicates while
> crawling. It takes the new line delimited rules, separated by either a
> tabulator or whitespace, followed by a + (PLUS) or - (MINUS) sign denoting
> whether or not a slash is to be added to the path.
> The normalizer ignores pages that look like files with extensions, see tests.
> Note: the normalizer must be enhanced to not take hosts as first argument of
> a rule, but host/path prefixes because some hosts need different rules
> depending on the root path. For example,
> * example.org/cms/news/1/2/3/4 is a CMS that doesn't accept slashes, if they
> are suffixed, the user is redirected to a non-slash page;
> * example.org/files/a/b/ wants to do it just the other way around.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)