On 27 Feb 2005, at 10:29, John Peacock wrote:

What I've always wanted was a fast parser library (probably an XS interface to a C library) which implemented the full BNF for both RFC-2822 and RFC-1341 (and any subsidiary extensions in wide usage). It seems like everyone writes their own parser; this may be due to the reality that those RFC's are not a complete definition (that's my impression anyways). Or at least take something like the best of both ripmime and reformime's code and create a single library to do the mechanical parsing, and then overlay that with a Perl OO-ish interface to make it easy to examine the content without having to know anything about the physical message itself.

The only interface to a C library I know of is Mail::Cclient which interfaces to pine's UW mail parser. But I don't think we want to impose that requirement on our users.


For example:

        my $msg = SuperMail::Parser($file, CLEANUP => 1);
        next unless $msg->multipart;
        foreach my $part ( $msg->body_parts() ) {
                next if $part->type("text/*");
                $dspam->scan($part->file);
                # if virus found, DECLINE or add header
        }

and have the attached file(s) be extracted at the moment requested, and the temporary files deleted when the $msg object goes out of scope. Does your mail parser do anything like this???

Sort of... Mine parses everything to temporary file handles at parse time, and the temp files go away when the message goes out of scope. But the version that SpamAssassin adopted parses to memory, so we can switch between either/or. Personally I'd like to have a version that just went to disk after a certain size threshold is exceeded, but allowed you to treat the object as transparently the same regardless of backend storage.


One of the biggest problems will be what the AV scanners actually want to see - do they want to scan filenames or do they want to scan a bytestream, or what? I guess we can get around that using temporary files either way though, but for high performance we want to avoid doing extra syscalls writing to disk if possible.

Matt.



Reply via email to