[ 
https://issues.apache.org/jira/browse/NUTCH-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16718708#comment-16718708
 ] 

Sebastian Nagel edited comment on NUTCH-2678 at 12/12/18 10:04 AM:
-------------------------------------------------------------------

Hi [~markus17], I've successfully tested the latest patch. It works: the 
protocol implementations are selected as configured in the 
host-protocol-mapping.txt.

Two important points to consider:
 # ??if no mapping is defined the method findExtension() is called just once, 
as it currently also is.??
 Looks like it is now called for every URL without a host mapping. Before it is 
called only when a protocol ([http://|http://issues.apache.org/], https://, 
[ftp://|ftp://issues.apache.org/]) is seen the first time, the result is kept 
in the objectCache using the key "{{Protocol.X_POINT_ID + protocolName}}".
 # one substantial problem: we need a way to explicitly define the default 
protocol plugin (as fall-back if no explicit host mapping exists)
 ** of course, all protocol plugin implementations used in the 
host-protocol-mapping.txt need also be activated via {{plugin.includes}}
 ** consequently multiple protocol plugins are now activate, all of them 
supporting http/https, and there is no control which plugin is selected by 
findExtension()

Adding explicit mappings by protocols could solve both points, e.g.
{noformat}
protocol:http <tab> org.apache.nutch.protocol.http.Http
{noformat}
Just need to make sure that the identifier (such as {{protocol:http}}) is never 
a valid hostname. But maybe there is a more elegant way to define the fall-back 
protocol implementations?

Minor points:
 - the host-protocol-mapping.txt should not define any mappings by default, 
should be also committed as host-protocol-mapping.txt.template so that 
modifications are not accidentally overwritten
 - when applying the patch git reports many lines with trailing white space
 - there is System.out.println(...) message (shouldn't be there)


was (Author: wastl-nagel):
Hi [~markus17], I've successfully tested the latest patch. It works: the 
protocol implementations are selected as configured in the 
host-protocol-mapping.txt.

Two important points to consider:
# ??if no mapping is defined the method findExtension() is called just once, as 
it currently also is.??
  Looks like it is now called for every URL without a host mapping. Before it 
is called only when a protocol (http://, https://, ftp://) is seen the first 
time, the result is kept in the objectCache using the key 
"{{Protocol.X_POINT_ID + protocolName}}".
# one substantial problem: we need a way to explicitly define the default 
protocol plugin (as fall-back if no explicit host mapping exists)
#* of course, all protocol plugin implementations used in the 
host-protocol-mapping.txt need also be activated via {{plugin.includes}}
#* consequently multiple protocol plugins are now activate, all of them 
supporting http/https, and there is no control which plugin is selected by 
findExtension()

Adding explicit mappings by protocols could solve both points, e.g.
{noformat}
protocol:http <tab> org.apache.nutch.protocol.http.Http
{noformat}
Just need to make sure that the identifier (such as {{protocol:http}}) is never 
a valid hostname. But maybe there is a more elegant way to define the fall-back 
protocol implementations?

Minor points:
- the host-protocol-mapping.txt should not define any mappings by default
- when applying the patch git reports many lines with trailing white space
- there is System.out.println(...) message (shouldn't be there)

> Allow for per-host configurable protocol plugin
> -----------------------------------------------
>
>                 Key: NUTCH-2678
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2678
>             Project: Nutch
>          Issue Type: Improvement
>          Components: protocol
>    Affects Versions: 1.15
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>            Priority: Major
>             Fix For: 1.16
>
>         Attachments: NUTCH-2678.patch, NUTCH-2678.patch, NUTCH-2678.patch
>
>
> Introduces new configuration file for mapping protocol plugins to hostnames.
> {code}
> # This file defines a hostname to protocol plugin mapping. Each line takes a
> # host name followed by a tab, followed by the ID of the protocol plugin. You
> # can find the ID in the protocol plugin's plugin.xml file.
> # 
> # <hostname>\t<plugin_id>\n
> # nutch.apache.org      org.apache.nutch.protocol.httpclient.Http
> # tika.apache.org       org.apache.nutch.protocol.http.Http
> #{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to