I like the regex option, and I _think_ that the anchor at the beginning (along with the lack of backtracking) shouldn't cause horrible performance degradation.
On Tue, Jun 9, 2020 at 7:04 AM Nick Burch <[email protected]> wrote: > Hi All > > At the moment, to detect RFC822 emails, we try and check for a bunch of > common header lines right at the start. If not, we check for a few "could > be an unusual header, could be some text", followed by checking for common > headers in a larger area of text below. > > For example, starts with "Received:" or starts with "X-" and has > "\nReceived:" near that, in mime-magic it's > > https://github.com/apache/tika/blob/master/tika-core/src/main/resources/org/apache/tika/mime/tika-mimetypes.xml#L6100 > > After a recent bug, we now have 3 different "could be a header not sure" > blocks at the start (X-, DKIM- or ARC-), all with exactly the same block > of possible real headers below. These need to be kept in sync between the > 3 initial matches, and if not could cause bugs > > Ideally, I'd like to group those three together to avoid that + simplify + > make it easier to understand > > > One option might be to make the first big a regexp, so we can do eg > ^((X-)|(DKIM-)|(ARC-)) to match all of them. Not sure if that's clearer, > nor the performance? Could maybe even then add the other headers to check > in after, if that doesn't make it too hard to understand? > > Alternately, we could maybe tweak the xml to support an or construct, so > you could give multiple ones to match at one level with multiple "normal > or's" below? > > Or something else? > > Any thoughts anyone? > > Thanks > Nick >
