On Apr 30, 2015, at 6:08 AM, Greg Landrum wrote:
> I still need to put some thought into patching the SDWriter so that it can 
> recognize things like consecutive line endings in property values. The big 
> question is what it should do when it encounters such a case. Is that an 
> error? Should it just write the output up to the blank line?

I think it's a question of your goal. Who is the target audience? How much base 
knowledge should they have? Then use that to guide which checks are worthwhile 
and which aren't.

I interpret the current RDKit design as being meant for people who understand 
the limits of the underlying format, and who won't do things to break it; or if 
they break it, will be able to understand how to resolve the problem. (Eg, with 
base64 encoding, or going on this list to ask for help.)

It's also possible to have a goal of preventing people from using RDKit to 
create an invalid SD file. For example, here is another way to create a 
corrupt, or a least ambiguous, file: 

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("C")
>>> mol.SetProp("abc", "x\0z")
>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("C")
>>> mol.SetProp("a>b", "xyz")
>>> writer = Chem.SDWriter("tmp.sdf")
>>> writer.write(mol)
>>> writer.close()
>>> content = open("tmp.sdf").read()
>>> print(content)

     RDKit          

  1  0  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
M  END
>  <a>b>  (1) 
xyz

$$$$

The spec says:

    Note: The > sign is a reserved character. A field name cannot
    contain hyphen (-), period (.), less than (<), greater than
    (>), equal sign (=), percent sign (%) or blank space ( ). Field
    names must begin with an alpha character and can contain alpha
    and numeric characters after that, including underscore.

while RDKit allows a hyphen and other characters in the tag.

This also causes a problem on input, because RDKit intermingles tag data with 
internal properties. Even though "__computedProps" is not a legal SD tag name, 
RDKit will read it, and give an error when it ends up trying to use that value 
as if it were real data:

>>> from rdkit import Chem
>>> print(open("tmp.sdf").read())

     RDKit          

  1  0  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
M  END
>  <__computedProps>  (1) 
Never

$$$$

>>> for mol in Chem.ForwardSDMolSupplier("tmp.sdf"):
...   print(Chem.MolToSmiles(mol))
... 
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
RuntimeError: boost::bad_any_cast: failed conversion using boost::any_cast

This means that users of a RDKit-backed web service can easily cause the 
back-end to raise an exception when it tries to do chemistry. RDKit is not 
designed with malicious users in mind.


When I come across these things now in RDKit, I follow the "Doctor, doctor, it 
hurts when I hit myself in the head with a hammer" principle - "Well, don't do 
that."

I think it's worthwhile to have tools which are designed for knowledgable users 
who know not to do certain things.

I also think it's worthwhile to have tools where people don't need that 
knowledge, but these are harder to develop.

A surprisingly common refrain over the decades has been for the new generation 
of users to complain how the tools were developed for the "priesthood" of the 
experienced people in the previous generation. I empathize with that.

So long as we use SD files, all we can do is add extra sanity checks.

Once you figure out the new goal, then I can start filing new classes of bugs. 
;)

Here's a wild thought. The tag line also allows:

  - The field number DTn
  - The compound’s external and internal registry numbers.
  - Any combination of information

There's a hodgepodge of examples.
 
> <MELTING_POINT>
> 55     (MD-08974)     <BOILING_POINT>   DT12
> DT12   55
> (MD-0894)   <BOILING_POINT>   FROM ARCHIVES

All the tools I know about ignore terms which aren't in "<>"s, or "()"s.

We could say that the presence of the word "BASE64" is the new convention that 
the tag values are base64 encoded, folded across multiple lines, and "BASE64T" 
means that title is base64 encoded, without newline folding. "BASE64B" would 
mean that both the title and the body were base64 encoded.

Then when you have non-conformant data, automatically encode it, and label the 
field appropriately.

Cheers,


                                Andrew
                                da...@dalkescientific.com



------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud 
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to