https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5845
Summary: Hadoopify large-scale mass-check infrastructure
Product: Spamassassin
Version: unspecified
Platform: Other
OS/Version: other
Status: NEW
Severity: normal
Priority: P5
Component: Masses
AssignedTo: [email protected]
ReportedBy: [EMAIL PROTECTED]
We have an increasingly-complex infrastructure to perform batch jobs -- our
mass-checks. These are as follows:
1. nightly mass-checks, one per day, with all rules and no network accesses
2. weekly mass-checks, with all rules and network accesses permitted
3. per-commit mass-checks, several per day, with a small number of rules and a
small number of messages, to give fast feedback on how rules perform. (Right
now these cannot run with a large enough corpus of messages to be particularly
worthwhile, unfortunately.)
There is a multi-GB corpus of messages used to perform these. (The nightly and
weekly mass-checks also use private corpora, but let's just consider the public
stuff for this ticket.) The hardest part of distributing the mass-checks has
been working out how to distribute these messages.
I met up with Ian Holsman last week, who works on Hadoop (among other ASF
projects), and had a bit of a chat; he was pretty sure we'd have no problem
getting comparable or better perfomance out of Hadoop [1] than with our own
homegrown ssh/scp/https-like mass-check infrastructure. In particular, the
mass-check message distribution code is probably not as good as HDFS [2] ;)
[1]: http://hadoop.apache.org/
[2]: http://hadoop.apache.org/core/docs/current/hdfs_design.html
I'd be interested in giving it a try, if we could rustle up enough
compute/storage nodes to do so.
------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.