Adult/Non-adult: Is there an existing content clustering plugin (e.g. carrot) that can do this?
The scenario I'm thinking about is when an adult-related search query allows adult-related results and a non-adult query filters them out. E.g. Query 'toys' - you don't want adult-related results turning up .... -Euan Andrzej Bialecki wrote: > Webmaster wrote: >> Hi Otis, >> >> So far so good.. > > [...] > > Thank you for sharing this information with us! This sounds exciting. > >> When this next round of fetching is done I'm going to inject 10m >> valid urls >> from my fresh fetch lists and crawl to a depth of 10 to see what >> happens. >> My guess is it will return about 200m urls, this should be an adequate >> stress test of my sad cluster of outdated machines :) >> >> I am however still looking into filtering the results for adult content >> before I move it off the hadoop cluster and put it on the distributed >> search >> nodes local file systems. > > In my experience a simple word-list based classifier works well > enough, or a rule-based classifier similar in concept to SpamAssassin. > This tends to mark 80+ % of adult pages. You may want to keep them in > CrawlDb to prevent their re-discovery, just mark them with something > that prevents indexing and/or generating. > > Also, often a good strategy to collect higher-quality pages first is > to concentrate only on plain-looking URLs, i.e. without too many > strange characters, or all-numerical subdirectories, or too many > non-letter characters. > -- _____________________________________________________________ Euan Clark Software Developer - Search and Spider NZS.com : New Zealand Search Email: [EMAIL PROTECTED] Web: http://www.nzs.com/ Phone: +64 3 943 5447 Fax: +64 3 379 4886 Mobile: +64 27 5390483 Address: Level 1, 93 Manchester St, Christchurch, New Zealand Post: PO Box 13300, Christchurch 8011, New Zealand ______________________________________________________________
