Hi All

At the moment, to detect RFC822 emails, we try and check for a bunch of common header lines right at the start. If not, we check for a few "could be an unusual header, could be some text", followed by checking for common headers in a larger area of text below.

For example, starts with "Received:" or starts with "X-" and has "\nReceived:" near that, in mime-magic it's
https://github.com/apache/tika/blob/master/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L6100

After a recent bug, we now have 3 different "could be a header not sure" blocks at the start (X-, DKIM- or ARC-), all with exactly the same block of possible real headers below. These need to be kept in sync between the 3 initial matches, and if not could cause bugs

Ideally, I'd like to group those three together to avoid that + simplify + make it easier to understand


One option might be to make the first big a regexp, so we can do eg ^((X-)|(DKIM-)|(ARC-)) to match all of them. Not sure if that's clearer, nor the performance? Could maybe even then add the other headers to check in after, if that doesn't make it too hard to understand?

Alternately, we could maybe tweak the xml to support an or construct, so you could give multiple ones to match at one level with multiple "normal or's" below?

Or something else?

Any thoughts anyone?

Thanks
Nick

Reply via email to