[
https://issues.apache.org/jira/browse/NUTCH-2678?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16718708#comment-16718708
]
Sebastian Nagel edited comment on NUTCH-2678 at 12/12/18 10:04 AM:
-------------------------------------------------------------------
Hi [~markus17], I've successfully tested the latest patch. It works: the
protocol implementations are selected as configured in the
host-protocol-mapping.txt.
Two important points to consider:
# ??if no mapping is defined the method findExtension() is called just once,
as it currently also is.??
Looks like it is now called for every URL without a host mapping. Before it is
called only when a protocol ([http://|http://issues.apache.org/], https://,
[ftp://|ftp://issues.apache.org/]) is seen the first time, the result is kept
in the objectCache using the key "{{Protocol.X_POINT_ID + protocolName}}".
# one substantial problem: we need a way to explicitly define the default
protocol plugin (as fall-back if no explicit host mapping exists)
** of course, all protocol plugin implementations used in the
host-protocol-mapping.txt need also be activated via {{plugin.includes}}
** consequently multiple protocol plugins are now activate, all of them
supporting http/https, and there is no control which plugin is selected by
findExtension()
Adding explicit mappings by protocols could solve both points, e.g.
{noformat}
protocol:http <tab> org.apache.nutch.protocol.http.Http
{noformat}
Just need to make sure that the identifier (such as {{protocol:http}}) is never
a valid hostname. But maybe there is a more elegant way to define the fall-back
protocol implementations?
Minor points:
- the host-protocol-mapping.txt should not define any mappings by default,
should be also committed as host-protocol-mapping.txt.template so that
modifications are not accidentally overwritten
- when applying the patch git reports many lines with trailing white space
- there is System.out.println(...) message (shouldn't be there)
was (Author: wastl-nagel):
Hi [~markus17], I've successfully tested the latest patch. It works: the
protocol implementations are selected as configured in the
host-protocol-mapping.txt.
Two important points to consider:
# ??if no mapping is defined the method findExtension() is called just once, as
it currently also is.??
Looks like it is now called for every URL without a host mapping. Before it
is called only when a protocol (http://, https://, ftp://) is seen the first
time, the result is kept in the objectCache using the key
"{{Protocol.X_POINT_ID + protocolName}}".
# one substantial problem: we need a way to explicitly define the default
protocol plugin (as fall-back if no explicit host mapping exists)
#* of course, all protocol plugin implementations used in the
host-protocol-mapping.txt need also be activated via {{plugin.includes}}
#* consequently multiple protocol plugins are now activate, all of them
supporting http/https, and there is no control which plugin is selected by
findExtension()
Adding explicit mappings by protocols could solve both points, e.g.
{noformat}
protocol:http <tab> org.apache.nutch.protocol.http.Http
{noformat}
Just need to make sure that the identifier (such as {{protocol:http}}) is never
a valid hostname. But maybe there is a more elegant way to define the fall-back
protocol implementations?
Minor points:
- the host-protocol-mapping.txt should not define any mappings by default
- when applying the patch git reports many lines with trailing white space
- there is System.out.println(...) message (shouldn't be there)
> Allow for per-host configurable protocol plugin
> -----------------------------------------------
>
> Key: NUTCH-2678
> URL: https://issues.apache.org/jira/browse/NUTCH-2678
> Project: Nutch
> Issue Type: Improvement
> Components: protocol
> Affects Versions: 1.15
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Priority: Major
> Fix For: 1.16
>
> Attachments: NUTCH-2678.patch, NUTCH-2678.patch, NUTCH-2678.patch
>
>
> Introduces new configuration file for mapping protocol plugins to hostnames.
> {code}
> # This file defines a hostname to protocol plugin mapping. Each line takes a
> # host name followed by a tab, followed by the ID of the protocol plugin. You
> # can find the ID in the protocol plugin's plugin.xml file.
> #
> # <hostname>\t<plugin_id>\n
> # nutch.apache.org org.apache.nutch.protocol.httpclient.Http
> # tika.apache.org org.apache.nutch.protocol.http.Http
> #{code}
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)