Mime type magic and repeated similar blocks - thoughts?

Nick Burch Tue, 09 Jun 2020 04:05:18 -0700

Hi All

At the moment, to detect RFC822 emails, we try and check for a bunch ofcommon header lines right at the start. If not, we check for a few "couldbe an unusual header, could be some text", followed by checking for commonheaders in a larger area of text below.

For example, starts with "Received:" or starts with "X-" and has"\nReceived:" near that, in mime-magic it's

https://github.com/apache/tika/blob/master/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L6100

After a recent bug, we now have 3 different "could be a header not sure"blocks at the start (X-, DKIM- or ARC-), all with exactly the same blockof possible real headers below. These need to be kept in sync between the3 initial matches, and if not could cause bugs

Ideally, I'd like to group those three together to avoid that + simplify +make it easier to understand

One option might be to make the first big a regexp, so we can do eg^((X-)|(DKIM-)|(ARC-)) to match all of them. Not sure if that's clearer,nor the performance? Could maybe even then add the other headers to checkin after, if that doesn't make it too hard to understand?

Alternately, we could maybe tweak the xml to support an or construct, soyou could give multiple ones to match at one level with multiple "normalor's" below?


Or something else?

Any thoughts anyone?

Thanks
Nick

Mime type magic and repeated similar blocks - thoughts?

Reply via email to