[
https://issues.apache.org/jira/browse/NUTCH-366?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13249803#comment-13249803
]
Yangxiaolong commented on NUTCH-366:
------------------------------------
Hello,I'm also interested in this issue.
I have submit an proposal in the gsoc.
link
here:http://www.google-melange.com/gsoc/proposal/review/google/gsoc2012/yangxiaolong/1
My initial solution:
(1)The issue's reporter think that make URLFilters and URLNormalizers
implement the same interface, but I think we should make interface (URLFilter
and URLNormailzer) extends the same interface,then we achieve an class
URLManglingers uses a "chained mangling" pattern to run defined normalizers and
filters.so you can maximize not change the existing code,such as:
older code:
public interface URLNormalizer extends Configurable {
/* Extension ID */
public static final String X_POINT_ID = URLNormalizer.class.getName();
/* Interface for URL normalization */
public String normalize(String urlString, String scope) throws
MalformedURLException;}
new code:
public interface URLMangling extends Configurable{
public String mangling(String[] args);}
public interface URLNormailzer extends URLMangling{
/* Extension ID */
public static final String X_POINT_ID = URLNormalizer.class.getName();
/* Interface for URL normalization */
public String normalize(String urlString, String scope) throws
MalformedURLException;}
(2) use a property "url.mangling.order", this property is not used to
determine which is first used,but used to determine whether to use the new
code, for example:
older code:
if (urlNormalizers) {
try {
url = normalizers.normalize (url, scope); / / normalize the url
} Catch (Exception e) {
LOG.warn ("Skipping" + url + ":" + e);
url = null;
}
}
if (url! = null && urlFiltering) {
try {
url = filters.filter (url); / / filter the url
} Catch (Exception e) {
LOG.warn ("Skipping" + url + ":" + e);
url = null;
}
}
the new code:
if (urlmangling == null) {
if (urlNormalizers) {
try {
url = normalizers.normalize (url, scope); / / normalize the url
} Catch (Exception e) {
LOG.warn ("Skipping" + url + ":" + e);
url = null;
}
}
if (url! = null && urlFiltering) {
try {
url = filters.filter (url); / / filter the url
} Catch (Exception e) {
LOG.warn ("Skipping" + url + ":" + e);
url = null;
}
}
}
else {...} / / We can make filter and normalizer run in accordance with our
definition of the order
> Merge URLFilters and URLNormalizers
> -----------------------------------
>
> Key: NUTCH-366
> URL: https://issues.apache.org/jira/browse/NUTCH-366
> Project: Nutch
> Issue Type: Improvement
> Reporter: Andrzej Bialecki
> Labels: gsoc2012
>
> Currently Nutch uses two subsystems related to url validation and
> normalization:
> * URLFilter: this interface checks if URLs are valid for further processing.
> Input URL is not changed in any way. The output is a boolean value.
> * URLNormalizer: this interface brings URLs to their base ("normal") form, or
> removes unneeded URL components, or performs any other URL mangling as
> necessary. Input URLs are changed, and are returned as result.
> However, various Nutch tools run filters and normalizers in pre-determined
> order, i.e. normalizers first, and then filters. In some cases, where
> normalizers are complex and running them is costly (e.g. numerous regex
> rules, DNS lookups) it would make sense to run some of the filters first
> (e.g. prefix-based filters that select only certain protocols, or
> suffix-based filters that select only known "extensions"). This is currently
> not possible - we always have to run normalizers, only to later throw away
> urls because they failed to pass through filters.
> I would like to solicit comments on the following two solutions, and work on
> implementation of one of them:
> 1) we could make URLFilters and URLNormalizers implement the same interface,
> and basically make them interchangeable. This way users could configure their
> order arbitrarily, even mixing filters and normalizers out of order. This is
> more complicated, but gives much more flexibility - and NUTCH-365 already
> provides sufficient framework to implement this, including the ability to
> define different sequences for different steps in the workflow.
> 2) we could use a property "url.mangling.order" ;) to define whether
> normalizers or filters should run first. This is simple to implement, but
> provides only limited improvement - because either all filters or all
> normalizers would run, they couldn't be mixed in arbitrary order.
> Any comments?
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira