Here's a patch to add PDB detection. It has a ten-character magic
sequence at the start ("HEADER "), and I make sure that it at least
looks something like a PDB file before concluding that it is. There
are regex tests, but they don't actually run unless the initial string
match succeeds, so I don't think the performance hit is particularly
severe.
I don't have a good source of PDB files, so if this magic fails
(either false positive *or* false negative), please attach one or more
samples to this bug report and I'll try to adapt the patch.
Adam Buchbinder
--- file/magic/Magdir/scientific 2009-02-16 10:59:52.000000000 -0500
+++ file/magic/Magdir/scientific 2009-02-18 16:34:11.000000000 -0500
@@ -69,3 +69,32 @@
0 string \060\000\040\000\110\000\105\000\101\000\104\000 GEDCOM data
0 string \376\377\000\060\000\040\000\110\000\105\000\101\000\104 GEDCOM data
0 string \377\376\060\000\040\000\110\000\105\000\101\000\104\000 GEDCOM data
+
+# PDB: Protein Data Bank files
+#
+# Adam Buchbinder <[email protected]>
+#
+# http://www.wwpdb.org/documentation/format32/sect2.html
+# http://www.ch.ic.ac.uk/chemime/
+#
+# The PDB file format is fixed-field, 80 columns. From the spec:
+#
+# COLS DATA
+# 1 - 6 "HEADER"
+# 11 - 50 String(40)
+# 51 - 59 Date
+# 63 - 66 IDcode
+#
+# Thus, positions 7-10, 60-62 and 67-80 are spaces. The Date must be in the
+# format DD-MMM-YY, e.g., 01-JAN-70, and the IDcode consists of numbers and
+# uppercase letters. However, examples have been seen without the date string,
+# e.g., the example on the chemime site.
+
+0 string HEADER\ \ \ \
+>&0 regex/1 \^.{40}
+>>&0 regex/1 [0-9]{2}-[A-Z]{3}-[0-9]{2}\ {3}
+>>>&0 regex/1s [A-Z0-9]{4}.{14}$
+>>>>&0 regex/1 [A-Z0-9]{4} Protein Data Bank data, ID Code %s
+!:mime chemical/x-pdb
+>>>>0 regex/1 [0-9]{2}-[A-Z]{3}-[0-9]{2} \b, %s
+