[ 
http://issues.apache.org/jira/browse/NUTCH-365?page=comments#action_12433613 ] 
            
Doug Cook commented on NUTCH-365:
---------------------------------

Hi, Andrzej.

Sounds very cool. Haven't had a chance to check out the patch yet to see if it 
supports this, but attaching a related discussion from the email list...

------

Neal Richter wrote:

Doug, 

I think it sounds like a good idea.  It eliminates the need to order the 
rules precisely... 

We don't iterate them in HtDig and it's been on my todo list for a while as 
well. 

I would iterate until no matches, some max iteration number, or the URL is 
obviously junk. 

For the max iteration number I would use the number of rewrite rules you 
have.  So if you have 10 rules, you iterate on all 10 rules 10 times.  That 
will cover the case where your rules 'chain' in a 10 step sequence.  Sure 
it's an edge case to do that, but I can see rule sets where you construct 
3-step chains (like swapping strings or something). 

Thanks 

Neal 

On 8/30/06, Doug Cook <[EMAIL PROTECTED]> wrote: 
> 
> 
> Hi, 
> 
> I've run across a few patterns in URLs where applying a normalization puts 
> the URL in a form matching another normalization pattern (or even the same 
> one). But that pattern won't get executed because the patterns are applied 
> only once. 
> 
> Should normalization iterate until no patterns match (with, perhaps, some 
> limit to the number of iterations to prevent loops from pattern mistakes)? 
> 
> It's a minor problem; it doesn't seem to affect too many URLs for things 
> like session ID removal, since finding two session IDs in the same URL is 
> rare (but does happen -- that's how I noticed this). I could imagine it 
> being much more significant, however, if other Nutch users out there are 
> using "broader" normalization patterns. 
> 
> Any philosophical/practical objections? (it's early, I've only had 1 
> coffee, 
> and I've probably missed something obvious!) 
> 
> I'll file an issue and add it to my queue of things to do if people think 
> its a good idea. 
> 
> -Doug 
> -- 
> View this message in context: 
> http://www.nabble.com/Should-URL-normalization-iterate--tf2190244.html#a6059957
>  
> Sent from the Nutch - Dev forum at Nabble.com. 
> 

> Flexible URL normalization
> --------------------------
>
>                 Key: NUTCH-365
>                 URL: http://issues.apache.org/jira/browse/NUTCH-365
>             Project: Nutch
>          Issue Type: Improvement
>    Affects Versions: 0.9.0
>            Reporter: Andrzej Bialecki 
>         Assigned To: Andrzej Bialecki 
>             Fix For: 0.9.0
>
>         Attachments: patch.txt
>
>
> This patch is a heavily restructured version of the patch in NUTCH-253, so 
> much that I decided to create a separate issue. It changes the URL 
> normalization from a selectable single class to a flexible and context-aware 
> chain of normalization filters.
> Highlights:
> * rename all *UrlNormalizer* to *URLNormalizer* for consistency.
> * use a "chained filter" pattern for running several normalizers in sequence
> * the order in which normalizers are executed is defined by 
> "urlnormalizer.order" property, which lists space-separated implementation 
> classes. If there are more normalizers active than explicitly named on this 
> list, they will be run in random order after the ones specified on the list 
> are executed.
> * define a set of contexts (or scopes) in which normalizers may be called. 
> Each scope can have its own list of normalizers (via 
> "urlnormalizer.scope.<scope_name>" property) and its own order (via 
> "urlnormalizer.order.<scope_name>" property). If any of these properties are 
> missing, default settings are used.
> * each normalizer may further select among many configurations, depending on 
> the context in which it is called, using a modified API:
>    URLNormalizer.normalize(String url, String scope);
> * if a config for a given scope is not defined, then the default config will 
> be used.
> * several standard contexts / scopes have been defined, and various 
> applications have been modified to attempt using appropriate normalizer in 
> their context.
> * all JUnit tests have been modified, and run successfully.
> NUTCH-363 suggests to me that further changes may be required in this area, 
> perhaps we should combine urlfilters and urlnormalizers into a single 
> subsystem of url munging - now that we have support for scopes and flexible 
> combinations of normalizers we could turn URLFilters into a special case of 
> normalizers (or vice versa, depending on the point of view) ... 

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to