In terms of scale, I would expect to see a server handle not much more than 500,000 messages in a full Declude/IMail environment, and with an average of more than 10 pieces of spam per address per day, a solution of this sort would need to effectively resolve against 50,000 or so E-mail addresses.  While I'm not at all sure how to properly index this information for rapid use, I do know that you could split the data into user and domain, and first query the domain, and then the user, and that would likely mean for the most part that you would need to do one query (full string match) on about 1,000 domains, and then another query on an average of maybe 50 user addresses.  Pete over at Sniffer has figured out how to search the entire source of a message with tens of thousands of rules complete with wildcards, and he does that quite efficiently considering that the application loads the entire rule base every time it is hit with a message.  I think a capable programmer would not at all be bothered by the demands.  There's absolutely no reason why this couldn't be done.

If you have a recommendation for how to best handle the task where data is initially sourced from a text file, please share it and I will pass that on.

Speaking of Sniffer - One thing you might consider is creating a special rulebase (we do contracts like that) that would contain 50K rules to match, well, practically any text you wish. We regularly match 50K heuristics these days in sub 100ms. Perhaps there is a special solution to be worked out here. We have tools to make this kind of thing feasible... Depending upon the rate of change, this might not require any unique software. We have a prototype java based utility for scripting updates to any rulebase in our system. Contact me off list if you'd like to pursue this direction.

_M

RESCU - REmote SCripted Updater, accepts an XML file representing changes/commands for the rulebase and produces a matching XML file result. Not quite ready for release into the wild, but close.

Reply via email to