Re: [Rdkit-discuss] SDF tags and "->"

Andrew Dalke Wed, 29 Apr 2015 06:12:08 -0700

Riccardo Vianello:
>  I suppose that if the correctness of the parser is confirmed, then a change 
> could be suggested for the writer, consisting in raising an error if blank 
> lines are present inside the data item.

Yes, the SD tag data is not a general purpose data field. It's not possible, 
for example, to embed the contents of an SD file as a value.

According to the spec:

   A [Data] value may extend over multiple lines containing up to 200
   characters each. A blank line terminates each data item.

The failure cases are: a data value that starts with a newline, or ends with a 
newline, or contains two or more successive newline characters. Some file 
readers, like Python's "universal" mode, will normalize "\r\n" to "\n" on 
input, so "\r\n\r\n" and variants are also problematic.

However, experience shows that there are many incorrectly written SD parsers. 
Some think that the data extends until the next line that starts with '>' or 
line that is '$$$$'. For example, one organization decided to use '$', '$$', 
'$$$', and '$$$$' as rough estimates of the cost of a compound. It might have 
looked like this:

     ....
  M  END
  > <cost>
  $$$$

  > <MW>
  123.45

   ...

Some non-compliant parsers interpret the '$$$$' as the end of the compound 
record, rather than the data value it's supposed to be.

In practice, enough SD parsers are broken in this regards that the best 
practice for a workflow is to avoid those byte sequences if at all possible.

A common solution to store free-form data is to use base64 encoding.

  >>> data = "\nHello\n\n> <and>\ngoodbye"*10
  >>> print(data.encode("base64"))
  CkhlbGxvCgo+IDxhbmQ+Cmdvb2RieWUKSGVsbG8KCj4gPGFuZD4KZ29vZGJ5ZQpIZWxsbwoKPiA8
  YW5kPgpnb29kYnllCkhlbGxvCgo+IDxhbmQ+Cmdvb2RieWUKSGVsbG8KCj4gPGFuZD4KZ29vZGJ5
  ZQpIZWxsbwoKPiA8YW5kPgpnb29kYnllCkhlbGxvCgo+IDxhbmQ+Cmdvb2RieWUKSGVsbG8KCj4g
  PGFuZD4KZ29vZGJ5ZQpIZWxsbwoKPiA8YW5kPgpnb29kYnllCkhlbGxvCgo+IDxhbmQ+Cmdvb2Ri
  eWU=

This also ensures that the line length never exceeds 200 characters. (I have no 
experience on what sort of errors might occur should the line length exceed 200 
characters.)

The SD format has no way to indicate the encoding, so the information about how 
tag data is encoded must be passed through other means. This is unfortunate.

And while I'm here, it's also a bad idea to have a "\0" (NUL) in the data, as 
some tools use C's string functions on the assumption that a "\0" does not 
exist. RDKit will write the "\0":

  >>> from rdkit import Chem
  >>> mol = Chem.MolFromSmiles("C")
  >>> mol.SetProp("abc", "x\0z")
  >>> mol.GetProp("abc")
  'x\x00z'
  >>> writer = Chem.SDWriter("tmp.sdf")
  >>> writer.write(mol)
  >>> writer.close()
  >>> content = open("tmp.sdf").read()
  >>> "\0" in content
  True

On Apr 29, 2015, at 12:27 PM, Tuomo Kalliokoski wrote:
> I suppose the current functionality of RDKit, irrespective to the SDF file 
> format specifications, is a bit odd: SDWriter produces file that 
> SDMolSupplier can't handle.

All of the toolkits I've used have the same behavior. They trust that the user 
of the API knows to not  pass arbitrary data as the value.

ChemAxon's toolkit, as you saw, can produce invalid SD files. You might see 
what happens if you add "x\n\n> <data2>\nvalue" as the value; I suspect you'll 
end up with a new "data2" tag.

I know that Open Babel and OEChem will also forward the value unchanged.

I can see the argument that RDKit should check for "\n\n" in the data. What 
should it do? It's built on the GetProp/SetProp mechanism, which allows 
arbitrary string data, and it's reasonable to SetProp() a value containing a 
"\n\n" for purposes other than writing to an SD file, so it has to be done in 
the reader or writer layer:

Here are some possibilities for changing the writer:

  - stop writing to the file, with an error
  - skip records which contain tags with forbidden values
  - skip tags which contains forbidden values
  - convert multiple newlines into one (including the edge cases)
  - also enforce the 200 character restriction
  - also enforce a check for well-known legal but ill-advised
      character sequences like a line starting with ">", or
      starting with "$$$$", or containing a "\0".

Paolo's suggestion is to change the reader to be more lenient:

-    throw FileParseException("Problems encountered parsing data fields");    
+    BOOST_LOG(rdWarningLog)    
+        << "Ignoring spurious lines encountered parsing data fields"    
+        << std::endl;

While it's true that this would be lenient, it wouldn't handle Tuomo's problem. 
Tuomo has the following:

  > <data1>
  X

  Y

  > <data2>
  Whatever

Tuomo wants data1 to become "X\n\n\nY".

Paolo's patch will set data1 to "X", and generate a warning for the spurious Y 
but otherwise ignore it. I do not believe this is any better for Tuomo that 
what RDKit does now -- actually, it's worse because it's data loss and people 
tend to ignore warnings.

I think the problem is, what should the writer do if given data which cannot be 
represented as an SD data value? Suppose that one of Tuomo's users added 
"X\n\nY\n><data1>\nabc" for the value of data1. The output would be the 
following, which is not a valid SD file:

  > <data1>
  X

  Y
  ><data1>
  abc

With Paolo's patch, the reader would see "data1" as the value "X", generate a 
warning for the spurious "Y" but do nothing with it, then see a redefinition of 
"data1" with content "abc".

Thus, I do not think it's an appropriate patch.

I don't have a good solution. Were it me, I would have the writer fail should 
any unsupported value be present in the output, including those which are 
allowed by the SD specification but will cause problems in practice, like 
embedded "\0" and leading "$$$$".

This is because my "threat model" - the circumstances I'm trying to prevent - 
is to prevent malicious people like me from crashing a system written by people 
who don't have intimate knowledge of how to work with SD files through a 
chemistry API. In that case, the API needs to be very strict about what it does.

That's a very unusual threat model. The RDKit threat model, shared by the other 
toolkits, is that people might generate this data by accident, but will be able 
to investigate and fix the error. In this case it's fine to simply stop 
processing, with an error report, rather than to warn and try to continue 
processing.

Cheers,

                                Andrew
                                da...@dalkescientific.com

------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] SDF tags and "->"

Reply via email to