[ 
https://issues.apache.org/jira/browse/NUTCH-2858?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17309931#comment-17309931
 ] 

ASF GitHub Bot commented on NUTCH-2858:
---------------------------------------

sebastian-nagel opened a new pull request #575:
URL: https://github.com/apache/nutch/pull/575


   - if URL includes a port the protocol is not normalized
   
   Note that
   - urlnormalizer-basic removes default ports:  `https://example.com:443/` is 
normalized to `https://example.com/` - by chaining normalizers there is no need 
to handle default ports in urlnormalizer-protocol
   - non-default ports can always be mapped by urlnormalizer-regex, there 
shouldn't be many, so the price of more complex rules and slower execution is 
acceptable


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]


> urlnormalizer-protocol: URL port is lost during normalization
> -------------------------------------------------------------
>
>                 Key: NUTCH-2858
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2858
>             Project: Nutch
>          Issue Type: Bug
>          Components: plugin, urlnormalizer
>    Affects Versions: 1.18
>            Reporter: Sebastian Nagel
>            Assignee: Sebastian Nagel
>            Priority: Minor
>             Fix For: 1.19
>
>
> If a URL includes a port, e.g. {{http://example.com:8080/}} or 
> {{https://example.com:8443/}}, the port is removed when normalizing using the 
> protocol-urlnormalizer.
> Instead, if the port is set,
> - the port should be kept as is and
> - the protocol should be unchanged
>    -* keeping the port and changing the protocol might result in a connection 
> failure
>    -* unlike the default port mappings (80 (http) <> 443 (https)), 
> non-default port mappings (8080 <> 8443) are risky and unlikely to work on 
> every server setup.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to