[
http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12435175 ]
Sami Siren commented on NUTCH-365:
----------------------------------
looks ok to me,
the ugly (with &) regexps could perhaps be put inside ![CDATA[ ]]> elements
in generator there's
+ try {
+ host = normalizers.normalize(host,
URLNormalizers.SCOPE_GENERATE_HOST_COUNT);
+ host = new URL(host).getHost().toLowerCase();
+ } catch (Exception e) {
+ LOG.warn("Malformed URL: '" + host + "', skipping");
+ }
why isn't the .toLowerCase also done in normalizer
> Flexible URL normalization
> --------------------------
>
> Key: NUTCH-365
> URL: http://issues.apache.org/jira/browse/NUTCH-365
> Project: Nutch
> Issue Type: Improvement
> Affects Versions: 0.9.0
> Reporter: Andrzej Bialecki
> Assigned To: Andrzej Bialecki
> Fix For: 0.9.0
>
> Attachments: patch.txt
>
>
> This patch is a heavily restructured version of the patch in NUTCH-253, so
> much that I decided to create a separate issue. It changes the URL
> normalization from a selectable single class to a flexible and context-aware
> chain of normalization filters.
> Highlights:
> * rename all *UrlNormalizer* to *URLNormalizer* for consistency.
> * use a "chained filter" pattern for running several normalizers in sequence
> * the order in which normalizers are executed is defined by
> "urlnormalizer.order" property, which lists space-separated implementation
> classes. If there are more normalizers active than explicitly named on this
> list, they will be run in random order after the ones specified on the list
> are executed.
> * define a set of contexts (or scopes) in which normalizers may be called.
> Each scope can have its own list of normalizers (via
> "urlnormalizer.scope.<scope_name>" property) and its own order (via
> "urlnormalizer.order.<scope_name>" property). If any of these properties are
> missing, default settings are used.
> * each normalizer may further select among many configurations, depending on
> the context in which it is called, using a modified API:
> URLNormalizer.normalize(String url, String scope);
> * if a config for a given scope is not defined, then the default config will
> be used.
> * several standard contexts / scopes have been defined, and various
> applications have been modified to attempt using appropriate normalizer in
> their context.
> * all JUnit tests have been modified, and run successfully.
> NUTCH-363 suggests to me that further changes may be required in this area,
> perhaps we should combine urlfilters and urlnormalizers into a single
> subsystem of url munging - now that we have support for scopes and flexible
> combinations of normalizers we could turn URLFilters into a special case of
> normalizers (or vice versa, depending on the point of view) ...
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira