Hi, I've run across a few patterns in URLs where applying a normalization puts the URL in a form matching another normalization pattern (or even the same one). But that pattern won't get executed because the patterns are applied only once.
Should normalization iterate until no patterns match (with, perhaps, some limit to the number of iterations to prevent loops from pattern mistakes)? It's a minor problem; it doesn't seem to affect too many URLs for things like session ID removal, since finding two session IDs in the same URL is rare (but does happen -- that's how I noticed this). I could imagine it being much more significant, however, if other Nutch users out there are using "broader" normalization patterns. Any philosophical/practical objections? (it's early, I've only had 1 coffee, and I've probably missed something obvious!) I'll file an issue and add it to my queue of things to do if people think its a good idea. -Doug -- View this message in context: http://www.nabble.com/Should-URL-normalization-iterate--tf2190244.html#a6059957 Sent from the Nutch - Dev forum at Nabble.com.
