On Apr 30, 2015, at 6:08 AM, Greg Landrum wrote:
> I still need to put some thought into patching the SDWriter so that it can
> recognize things like consecutive line endings in property values. The big
> question is what it should do when it encounters such a case. Is that an
> error? Should it just write the output up to the blank line?
I think it's a question of your goal. Who is the target audience? How much base
knowledge should they have? Then use that to guide which checks are worthwhile
and which aren't.
I interpret the current RDKit design as being meant for people who understand
the limits of the underlying format, and who won't do things to break it; or if
they break it, will be able to understand how to resolve the problem. (Eg, with
base64 encoding, or going on this list to ask for help.)
It's also possible to have a goal of preventing people from using RDKit to
create an invalid SD file. For example, here is another way to create a
corrupt, or a least ambiguous, file:
>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("C")
>>> mol.SetProp("abc", "x\0z")
>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("C")
>>> mol.SetProp("a>b", "xyz")
>>> writer = Chem.SDWriter("tmp.sdf")
>>> writer.write(mol)
>>> writer.close()
>>> content = open("tmp.sdf").read()
>>> print(content)
RDKit
1 0 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
M END
> <a>b> (1)
xyz
$$$$
The spec says:
Note: The > sign is a reserved character. A field name cannot
contain hyphen (-), period (.), less than (<), greater than
(>), equal sign (=), percent sign (%) or blank space ( ). Field
names must begin with an alpha character and can contain alpha
and numeric characters after that, including underscore.
while RDKit allows a hyphen and other characters in the tag.
This also causes a problem on input, because RDKit intermingles tag data with
internal properties. Even though "__computedProps" is not a legal SD tag name,
RDKit will read it, and give an error when it ends up trying to use that value
as if it were real data:
>>> from rdkit import Chem
>>> print(open("tmp.sdf").read())
RDKit
1 0 0 0 0 0 0 0 0 0999 V2000
0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0
M END
> <__computedProps> (1)
Never
$$$$
>>> for mol in Chem.ForwardSDMolSupplier("tmp.sdf"):
... print(Chem.MolToSmiles(mol))
...
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
RuntimeError: boost::bad_any_cast: failed conversion using boost::any_cast
This means that users of a RDKit-backed web service can easily cause the
back-end to raise an exception when it tries to do chemistry. RDKit is not
designed with malicious users in mind.
When I come across these things now in RDKit, I follow the "Doctor, doctor, it
hurts when I hit myself in the head with a hammer" principle - "Well, don't do
that."
I think it's worthwhile to have tools which are designed for knowledgable users
who know not to do certain things.
I also think it's worthwhile to have tools where people don't need that
knowledge, but these are harder to develop.
A surprisingly common refrain over the decades has been for the new generation
of users to complain how the tools were developed for the "priesthood" of the
experienced people in the previous generation. I empathize with that.
So long as we use SD files, all we can do is add extra sanity checks.
Once you figure out the new goal, then I can start filing new classes of bugs.
;)
Here's a wild thought. The tag line also allows:
- The field number DTn
- The compound’s external and internal registry numbers.
- Any combination of information
There's a hodgepodge of examples.
> <MELTING_POINT>
> 55 (MD-08974) <BOILING_POINT> DT12
> DT12 55
> (MD-0894) <BOILING_POINT> FROM ARCHIVES
All the tools I know about ignore terms which aren't in "<>"s, or "()"s.
We could say that the presence of the word "BASE64" is the new convention that
the tag values are base64 encoded, folded across multiple lines, and "BASE64T"
means that title is base64 encoded, without newline folding. "BASE64B" would
mean that both the title and the body were base64 encoded.
Then when you have non-conformant data, automatically encode it, and label the
field appropriately.
Cheers,
Andrew
[email protected]
------------------------------------------------------------------------------
One dashboard for servers and applications across Physical-Virtual-Cloud
Widest out-of-the-box monitoring support with 50+ applications
Performance metrics, stats and reports that give you Actionable Insights
Deep dive visibility with transaction tracing using APM Insight.
http://ad.doubleclick.net/ddm/clk/290420510;117567292;y
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss