Adult/Non-adult: Is there an existing content clustering plugin (e.g.
carrot) that can do this?

The scenario I'm thinking about  is when an adult-related search query
allows adult-related results and a non-adult query filters them out.

E.g. Query 'toys' - you don't want adult-related results turning up ....

-Euan


Andrzej Bialecki wrote:
> Webmaster wrote:
>> Hi Otis,
>>
>> So far so good..
>
> [...]
>
> Thank you for sharing this information with us! This sounds exciting.
>
>> When this next round of fetching is done I'm going to inject 10m
>> valid urls
>> from my fresh fetch lists and crawl to a depth of 10 to see what
>> happens.
>> My guess is it will return about 200m urls, this should be an adequate
>> stress test of my sad cluster of outdated machines :)
>>
>> I am however still looking into filtering the results for adult content
>> before I move it off the hadoop cluster and put it on the distributed
>> search
>> nodes local file systems.
>
> In my experience a simple word-list based classifier works well
> enough, or a rule-based classifier similar in concept to SpamAssassin.
> This tends to mark 80+ % of adult pages. You may want to keep them in
> CrawlDb to prevent their re-discovery, just mark them with something
> that prevents indexing and/or generating.
>
> Also, often a good strategy to collect higher-quality pages first is
> to concentrate only on plain-looking URLs, i.e. without too many
> strange characters, or all-numerical subdirectories, or too many
> non-letter characters.
>


-- 
_____________________________________________________________

Euan Clark
Software Developer - Search and Spider
NZS.com : New Zealand Search
Email: [EMAIL PROTECTED]
Web: http://www.nzs.com/

Phone: +64 3 943 5447
Fax: +64 3 379 4886
Mobile: +64 27 5390483
Address: Level 1, 93 Manchester St, Christchurch, New Zealand
Post: PO Box 13300, Christchurch 8011, New Zealand
______________________________________________________________


Reply via email to