https://issues.apache.org/SpamAssassin/show_bug.cgi?id=5845

           Summary: Hadoopify large-scale mass-check infrastructure
           Product: Spamassassin
           Version: unspecified
          Platform: Other
        OS/Version: other
            Status: NEW
          Severity: normal
          Priority: P5
         Component: Masses
        AssignedTo: [email protected]
        ReportedBy: [EMAIL PROTECTED]


We have an increasingly-complex infrastructure to perform batch jobs -- our
mass-checks.  These are as follows:

1. nightly mass-checks, one per day, with all rules and no network accesses

2. weekly mass-checks, with all rules and network accesses permitted

3. per-commit mass-checks, several per day, with a small number of rules and a
small number of messages, to give fast feedback on how rules perform. (Right
now these cannot run with a large enough corpus of messages to be particularly
worthwhile, unfortunately.)

There is a multi-GB corpus of messages used to perform these.  (The nightly and
weekly mass-checks also use private corpora, but let's just consider the public
stuff for this ticket.) The hardest part of distributing the mass-checks has
been working out how to distribute these messages.

I met up with Ian Holsman last week, who works on Hadoop (among other ASF
projects), and had a bit of a chat; he was pretty sure we'd have no problem
getting comparable or better perfomance out of Hadoop [1] than with our own
homegrown ssh/scp/https-like mass-check infrastructure. In particular, the
mass-check message distribution code is probably not as good as HDFS [2] ;)

[1]: http://hadoop.apache.org/

[2]: http://hadoop.apache.org/core/docs/current/hdfs_design.html

I'd be interested in giving it a try, if we could rustle up enough
compute/storage nodes to do so.



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.

Reply via email to