On Apr 30, 2015, at 6:08 AM, Greg Landrum wrote: > I still need to put some thought into patching the SDWriter so that it can > recognize things like consecutive line endings in property values. The big > question is what it should do when it encounters such a case. Is that an > error? Should it just write the output up to the blank line?
I think it's a question of your goal. Who is the target audience? How much base knowledge should they have? Then use that to guide which checks are worthwhile and which aren't. I interpret the current RDKit design as being meant for people who understand the limits of the underlying format, and who won't do things to break it; or if they break it, will be able to understand how to resolve the problem. (Eg, with base64 encoding, or going on this list to ask for help.) It's also possible to have a goal of preventing people from using RDKit to create an invalid SD file. For example, here is another way to create a corrupt, or a least ambiguous, file: >>> from rdkit import Chem >>> mol = Chem.MolFromSmiles("C") >>> mol.SetProp("abc", "x\0z") >>> from rdkit import Chem >>> mol = Chem.MolFromSmiles("C") >>> mol.SetProp("a>b", "xyz") >>> writer = Chem.SDWriter("tmp.sdf") >>> writer.write(mol) >>> writer.close() >>> content = open("tmp.sdf").read() >>> print(content) RDKit 1 0 0 0 0 0 0 0 0 0999 V2000 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 M END > <a>b> (1) xyz $$$$ The spec says: Note: The > sign is a reserved character. A field name cannot contain hyphen (-), period (.), less than (<), greater than (>), equal sign (=), percent sign (%) or blank space ( ). Field names must begin with an alpha character and can contain alpha and numeric characters after that, including underscore. while RDKit allows a hyphen and other characters in the tag. This also causes a problem on input, because RDKit intermingles tag data with internal properties. Even though "__computedProps" is not a legal SD tag name, RDKit will read it, and give an error when it ends up trying to use that value as if it were real data: >>> from rdkit import Chem >>> print(open("tmp.sdf").read()) RDKit 1 0 0 0 0 0 0 0 0 0999 V2000 0.0000 0.0000 0.0000 C 0 0 0 0 0 0 0 0 0 0 0 0 M END > <__computedProps> (1) Never $$$$ >>> for mol in Chem.ForwardSDMolSupplier("tmp.sdf"): ... print(Chem.MolToSmiles(mol)) ... Traceback (most recent call last): File "<stdin>", line 2, in <module> RuntimeError: boost::bad_any_cast: failed conversion using boost::any_cast This means that users of a RDKit-backed web service can easily cause the back-end to raise an exception when it tries to do chemistry. RDKit is not designed with malicious users in mind. When I come across these things now in RDKit, I follow the "Doctor, doctor, it hurts when I hit myself in the head with a hammer" principle - "Well, don't do that." I think it's worthwhile to have tools which are designed for knowledgable users who know not to do certain things. I also think it's worthwhile to have tools where people don't need that knowledge, but these are harder to develop. A surprisingly common refrain over the decades has been for the new generation of users to complain how the tools were developed for the "priesthood" of the experienced people in the previous generation. I empathize with that. So long as we use SD files, all we can do is add extra sanity checks. Once you figure out the new goal, then I can start filing new classes of bugs. ;) Here's a wild thought. The tag line also allows: - The field number DTn - The compound’s external and internal registry numbers. - Any combination of information There's a hodgepodge of examples. > <MELTING_POINT> > 55 (MD-08974) <BOILING_POINT> DT12 > DT12 55 > (MD-0894) <BOILING_POINT> FROM ARCHIVES All the tools I know about ignore terms which aren't in "<>"s, or "()"s. We could say that the presence of the word "BASE64" is the new convention that the tag values are base64 encoded, folded across multiple lines, and "BASE64T" means that title is base64 encoded, without newline folding. "BASE64B" would mean that both the title and the body were base64 encoded. Then when you have non-conformant data, automatically encode it, and label the field appropriately. Cheers, Andrew da...@dalkescientific.com ------------------------------------------------------------------------------ One dashboard for servers and applications across Physical-Virtual-Cloud Widest out-of-the-box monitoring support with 50+ applications Performance metrics, stats and reports that give you Actionable Insights Deep dive visibility with transaction tracing using APM Insight. http://ad.doubleclick.net/ddm/clk/290420510;117567292;y _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss