On Sat, Nov 27, 2010 at 9:47 AM, [email protected] <[email protected]> wrote: > On Sat, Nov 27, 2010 at 6:49 AM, Greg Landrum <[email protected]> wrote: >> At the moment there isn't a particularly satisfying way of doing an >> equality search aside from adding a smiles column to the database and >> just doing a straight equality search on that. > > Ok. > >> To that end it's probably useful to know that the smiles generated by >> the cartridge when you convert a molecule to text is canonical. > > If I'm not getting fooled, it seems the structure is also stored in > canonical format; e.g if I store: > > 'COc(cc1)ccc1C#N' > > then I "select * from molecules;" I get back 'COc1ccc(C#N)cc1'
it's not quite that straightforward. The molecules are stored in a blob column in the standard RDKit binary form (what you get when you use the .ToBinary() method in Python). The cartridge provides a rule that can cast from this type to a string; this is done by loading the binary object and then generating canonical smiles from it. > If this is correct I should be able to search with the "=" operator > directly, provided I prepare the query smilles with Chem.CanonSmiles, > isn't it? You actually don't even have to do that, simply doing "COc(cc1)ccc1C#N"::mol::text will give you the canonical smiles > That would avoid adding a specific smiles column. yeah, but it is, unfortunately, very expensive since the canonical smiles will be generated for each database molecule at query time. Here's an example I just ran querying the chembl example database (214K rows): chembl=# select * from mols where m<@'CC(=O)c1ccc2c(c1)C(=O)C(=O)N2C' and m@>'CC(=O)c1ccc2c(c1)C(=O)C(=O)N2' and m::text='CC(=O)c1ccc2c(c1)C(=O)C(=O)N2C'::mol::text; regno | m --------+-------------------------------- 246028 | CC(=O)c1ccc2c(c1)C(=O)C(=O)N2C (1 row) Time: 34.449 ms chembl=# select * from mols where m::text='CC(=O)c1ccc2c(c1)C(=O)C(=O)N2C'::mol::text; regno | m --------+-------------------------------- 246028 | CC(=O)c1ccc2c(c1)C(=O)C(=O)N2C (1 row) Time: 175219.805 ms I think the argument against the second query is clear. :-) >> >> Without adding the smiles column, another option that should be >> correct, though it's somewhat ugly, is: >> select * from mols where m<@'CC(=O)c1ccc2c(c1)C(=O)C(=O)N2C' and >> m@>'CC(=O)c1ccc2c(c1)C(=O)C(=O)N2' and >> m::text='CC(=O)c1ccc2c(c1)C(=O)C(=O)N2C'::mol::text; >> >> If the molecule column is indexed, this will use the index so it's >> actually reasonably efficient. If you don't care about stereochemistry >> you can leave the last bit (SMILES comparison) out. >> > > Yeah, ugly but I just tried and it actually works. glad to hear it. >> Having a less ugly way of doing equality querying would be useful; >> that would be a good feature request. > > Ok, so where should I report it ? ;-) now that's an easy one: http://sourceforge.net/tracker/?group_id=160139&atid=814653 -greg ------------------------------------------------------------------------------ Increase Visibility of Your 3D Game App & Earn a Chance To Win $500! Tap into the largest installed PC base & get more eyes on your game by optimizing for Intel(R) Graphics Technology. Get started today with the Intel(R) Software Partner Program. Five $500 cash prizes are up for grabs. http://p.sf.net/sfu/intelisp-dev2dev _______________________________________________ Rdkit-discuss mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

