[
http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12435449 ]
Doug Cook commented on NUTCH-365:
-
It still seems to me that iterative normalization is useful and not risky. By
definition, a normalizer is something which
[
http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12435185 ]
Andrzej Bialecki commented on NUTCH-365:
-
Lowercasing is done here because we can't rely on each normalizer to do it, and
having uniform host names is
[
http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12435175 ]
Sami Siren commented on NUTCH-365:
--
looks ok to me,
the ugly (with amp;) regexps could perhaps be put inside ![CDATA[ ]] elements
in generator there's
+ try {
+
[
http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12433890 ]
Andrzej Bialecki commented on NUTCH-365:
-
Running several iterations of filters/normalizers may be risky ... We would
have to ensure that match/replace
[
http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12433613 ]
Doug Cook commented on NUTCH-365:
-
Hi, Andrzej.
Sounds very cool. Haven't had a chance to check out the patch yet to see if it
supports this, but attaching a
[
http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12433617 ]
Doug Cook commented on NUTCH-365:
-
PS. I like your idea of combining URL filters normalization. In a sense, a
filter is just a normalizer that happens to