On Wed, May 19, 2004 at 02:59:24PM -0700, Byron Miller wrote: > I just did a generate DB to create a fresh segment to > fetch and i have this setup in the urlfilter > > > # skip 'file:' urls > -^file: > -^ftp: > -^gopher: > -^mailto: > -^https: > > is that the correct way to defign those? I added FTP
The regex filter ignores anything not covered by some "+" expression. So I assume you have +. or similar after those lines. In its default, the regex-urlfilter.txt is of the so called "deny specific, allow any" style. You can change it into "allow specific, deny any" one. > since FTP slows the crawler to a stand still (doesn't > seem to gracefully end or it fills up all the I always seperate ftp fetch from http one, because the responsiveness is different and, as I have observed, the current fetcher (I use RequestScheduler.java) doesn't seem to mix them well (performance wise). If you limit ftp to fetch directory list only, it may behave better when together with http in the same round of fetch. > threads), didn't want a bunch of spam addresses in > mailto's and since there is no parser for https by > default (or i din't have it enabled) i set that up. There are only http and ftp clients. > > I'm still seeing https urls come alone.. It may have been a redirect from http. There are quite a few naughty http/ftp admins out. John ------------------------------------------------------- This SF.Net email is sponsored by: Oracle 10g Get certified on the hottest thing ever to hit the market... Oracle 10g. Take an Oracle 10g class now, and we'll give you the exam FREE. http://ads.osdn.com/?ad_id=3149&alloc_id=8166&op=click _______________________________________________ Nutch-developers mailing list [EMAIL PROTECTED] https://lists.sourceforge.net/lists/listinfo/nutch-developers
