Hi Guys, I've configured my nutch engine to crawl all dynamic links but exclude all js, css and image files. I can see in my logs that the plugin Suffix URL Filter (urlfilter-suffix) is loaded as shon below: Registered Plugins: 2007-05-03 15:48:08,277 INFO plugin.PluginRepository - CyberNeko HTML Parser (lib-nekohtml) 2007-05-03 15:48:08,277 INFO plugin.PluginRepository - Site Query Filter (query-site) 2007-05-03 15:48:08,277 INFO plugin.PluginRepository - Basic URL Normalizer (urlnormalizer-basic) 2007-05-03 15:48:08,277 INFO plugin.PluginRepository - Html Parse Plug-in (parse-html) 2007-05-03 15:48:08,277 INFO plugin.PluginRepository - Pass-through URL Normalizer (urlnormalizer-pass) 2007-05-03 15:48:08,277 INFO plugin.PluginRepository - Regex URL Filter Framework (lib-regex-filter) 2007-05-03 15:48:08,277 INFO plugin.PluginRepository - Basic Indexing Filter (index-basic) 2007-05-03 15:48:08,277 INFO plugin.PluginRepository - Basic Summarizer Plug-in (summary-basic) 2007-05-03 15:48:08,277 INFO plugin.PluginRepository - Text Parse Plug-in (parse-text) 2007-05-03 15:48:08,277 INFO plugin.PluginRepository - JavaScript Parser (parse-js) 2007-05-03 15:48:08,277 INFO plugin.PluginRepository - Regex URL Filter (urlfilter-regex) 2007-05-03 15:48:08,277 INFO plugin.PluginRepository - Basic Query Filter (query-basic) 2007-05-03 15:48:08,277 INFO plugin.PluginRepository - HTTP Framework (lib-http) 2007-05-03 15:48:08,293 INFO plugin.PluginRepository - URL Query Filter (query-url) 2007-05-03 15:48:08,293 INFO plugin.PluginRepository - Regex URL Normalizer (urlnormalizer-regex) 2007-05-03 15:48:08,293 INFO plugin.PluginRepository - Http Protocol Plug-in (protocol-http) 2007-05-03 15:48:08,293 INFO plugin.PluginRepository - the nutch core extension points (nutch-extensionpoints) 2007-05-03 15:48:08,293 INFO plugin.PluginRepository - Suffix URL Filter (urlfilter-suffix) 2007-05-03 15:48:08,293 INFO plugin.PluginRepository - OPIC Scoring Plug-in (scoring-opic)
Anyway, when I checked the logs I saw that he kept trying to fetch url with JS and CSS extension like 2007-05-03 15:52:25,586 INFO fetcher.Fetcher - fetching http://www.toto.com/css/media.css?8 2007-05-03 15:56:12,224 INFO fetcher.Fetcher - fetching http://www.toto.com/ros/form.js?jashka8 It should not do that as I've clearly specified in my urlfilter to exclude those files. I tried to look at the code and I think the plugin doesn't manage correctly the dynamic URL with "?" and parameters after the extension of the file. Could you please help me on this subject and confirm if I'm right ? Thanks E Regards, ~E~
