[Nutch-dev] [jira] Commented: (NUTCH-365) Flexible URL normalization

2006-09-18 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12435449 ] Doug Cook commented on NUTCH-365: - It still seems to me that iterative normalization is useful and not risky. By definition, a normalizer is something which

[Nutch-dev] [jira] Commented: (NUTCH-365) Flexible URL normalization

2006-09-16 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12435185 ] Andrzej Bialecki commented on NUTCH-365: - Lowercasing is done here because we can't rely on each normalizer to do it, and having uniform host names is

[Nutch-dev] [jira] Commented: (NUTCH-365) Flexible URL normalization

2006-09-15 Thread Sami Siren (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12435175 ] Sami Siren commented on NUTCH-365: -- looks ok to me, the ugly (with amp;) regexps could perhaps be put inside ![CDATA[ ]] elements in generator there's + try { +

[Nutch-dev] [jira] Commented: (NUTCH-365) Flexible URL normalization

2006-09-11 Thread Andrzej Bialecki (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12433890 ] Andrzej Bialecki commented on NUTCH-365: - Running several iterations of filters/normalizers may be risky ... We would have to ensure that match/replace

[Nutch-dev] [jira] Commented: (NUTCH-365) Flexible URL normalization

2006-09-09 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12433613 ] Doug Cook commented on NUTCH-365: - Hi, Andrzej. Sounds very cool. Haven't had a chance to check out the patch yet to see if it supports this, but attaching a

[Nutch-dev] [jira] Commented: (NUTCH-365) Flexible URL normalization

2006-09-09 Thread Doug Cook (JIRA)
[ http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12433617 ] Doug Cook commented on NUTCH-365: - PS. I like your idea of combining URL filters normalization. In a sense, a filter is just a normalizer that happens to