Riccardo Vianello: > I suppose that if the correctness of the parser is confirmed, then a change > could be suggested for the writer, consisting in raising an error if blank > lines are present inside the data item.
Yes, the SD tag data is not a general purpose data field. It's not possible, for example, to embed the contents of an SD file as a value. According to the spec: A [Data] value may extend over multiple lines containing up to 200 characters each. A blank line terminates each data item. The failure cases are: a data value that starts with a newline, or ends with a newline, or contains two or more successive newline characters. Some file readers, like Python's "universal" mode, will normalize "\r\n" to "\n" on input, so "\r\n\r\n" and variants are also problematic. However, experience shows that there are many incorrectly written SD parsers. Some think that the data extends until the next line that starts with '>' or line that is '$$$$'. For example, one organization decided to use '$', '$$', '$$$', and '$$$$' as rough estimates of the cost of a compound. It might have looked like this: .... M END > <cost> $$$$ > <MW> 123.45 ... Some non-compliant parsers interpret the '$$$$' as the end of the compound record, rather than the data value it's supposed to be. In practice, enough SD parsers are broken in this regards that the best practice for a workflow is to avoid those byte sequences if at all possible. A common solution to store free-form data is to use base64 encoding. >>> data = "\nHello\n\n> <and>\ngoodbye"*10 >>> print(data.encode("base64")) CkhlbGxvCgo+IDxhbmQ+Cmdvb2RieWUKSGVsbG8KCj4gPGFuZD4KZ29vZGJ5ZQpIZWxsbwoKPiA8 YW5kPgpnb29kYnllCkhlbGxvCgo+IDxhbmQ+Cmdvb2RieWUKSGVsbG8KCj4gPGFuZD4KZ29vZGJ5 ZQpIZWxsbwoKPiA8YW5kPgpnb29kYnllCkhlbGxvCgo+IDxhbmQ+Cmdvb2RieWUKSGVsbG8KCj4g PGFuZD4KZ29vZGJ5ZQpIZWxsbwoKPiA8YW5kPgpnb29kYnllCkhlbGxvCgo+IDxhbmQ+Cmdvb2Ri eWU= This also ensures that the line length never exceeds 200 characters. (I have no experience on what sort of errors might occur should the line length exceed 200 characters.) The SD format has no way to indicate the encoding, so the information about how tag data is encoded must be passed through other means. This is unfortunate. And while I'm here, it's also a bad idea to have a "\0" (NUL) in the data, as some tools use C's string functions on the assumption that a "\0" does not exist. RDKit will write the "\0": >>> from rdkit import Chem >>> mol = Chem.MolFromSmiles("C") >>> mol.SetProp("abc", "x\0z") >>> mol.GetProp("abc") 'x\x00z' >>> writer = Chem.SDWriter("tmp.sdf") >>> writer.write(mol) >>> writer.close() >>> content = open("tmp.sdf").read() >>> "\0" in content True On Apr 29, 2015, at 12:27 PM, Tuomo Kalliokoski wrote: > I suppose the current functionality of RDKit, irrespective to the SDF file > format specifications, is a bit odd: SDWriter produces file that > SDMolSupplier can't handle. All of the toolkits I've used have the same behavior. They trust that the user of the API knows to not pass arbitrary data as the value. ChemAxon's toolkit, as you saw, can produce invalid SD files. You might see what happens if you add "x\n\n> <data2>\nvalue" as the value; I suspect you'll end up with a new "data2" tag. I know that Open Babel and OEChem will also forward the value unchanged. I can see the argument that RDKit should check for "\n\n" in the data. What should it do? It's built on the GetProp/SetProp mechanism, which allows arbitrary string data, and it's reasonable to SetProp() a value containing a "\n\n" for purposes other than writing to an SD file, so it has to be done in the reader or writer layer: Here are some possibilities for changing the writer: - stop writing to the file, with an error - skip records which contain tags with forbidden values - skip tags which contains forbidden values - convert multiple newlines into one (including the edge cases) - also enforce the 200 character restriction - also enforce a check for well-known legal but ill-advised character sequences like a line starting with ">", or starting with "$$$$", or containing a "\0". Paolo's suggestion is to change the reader to be more lenient: - throw FileParseException("Problems encountered parsing data fields"); + BOOST_LOG(rdWarningLog) + << "Ignoring spurious lines encountered parsing data fields" + << std::endl; While it's true that this would be lenient, it wouldn't handle Tuomo's problem. Tuomo has the following: > <data1> X Y > <data2> Whatever Tuomo wants data1 to become "X\n\n\nY". Paolo's patch will set data1 to "X", and generate a warning for the spurious Y but otherwise ignore it. I do not believe this is any better for Tuomo that what RDKit does now -- actually, it's worse because it's data loss and people tend to ignore warnings. I think the problem is, what should the writer do if given data which cannot be represented as an SD data value? Suppose that one of Tuomo's users added "X\n\nY\n><data1>\nabc" for the value of data1. The output would be the following, which is not a valid SD file: > <data1> X Y ><data1> abc With Paolo's patch, the reader would see "data1" as the value "X", generate a warning for the spurious "Y" but do nothing with it, then see a redefinition of "data1" with content "abc". Thus, I do not think it's an appropriate patch. I don't have a good solution. Were it me, I would have the writer fail should any unsupported value be present in the output, including those which are allowed by the SD specification but will cause problems in practice, like embedded "\0" and leading "$$$$". This is because my "threat model" - the circumstances I'm trying to prevent - is to prevent malicious people like me from crashing a system written by people who don't have intimate knowledge of how to work with SD files through a chemistry API. In that case, the API needs to be very strict about what it does. That's a very unusual threat model. The RDKit threat model, shared by the other toolkits, is that people might generate this data by accident, but will be able to investigate and fix the error. In this case it's fine to simply stop processing, with an error report, rather than to warn and try to continue processing. Cheers, Andrew da...@dalkescientific.com ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss