[Rdkit-discuss] mmpdb 3.1
Hi everyone, We have released mmpdb 3.1, which you can get from https://github.com/rdkit/mmpdb . mmpdb 3.0, released May 2023, merged three development tracks: - create and query 1-cut med chem transformations as described in Awale et al., The Playbooks of Medicinal Chemistry Design Moves, J. Chem. Inf. Model. 2021, 61, 2, 729–742 - support indexing large datasets on a distributed cluster - replace the hash-based fingerprint environment with a SMARTS/pseudo-SMILES description Version 3.1 adds support for the 2- and 3-cut med chem transformations described by Awale et al. There are also a few feature improvements, some performance tuning, and bug fixes. See the CHANGELOG for details. Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] SDMolSupplier warning 2023.9.2
Hi Mandar, > On Dec 13, 2023, at 03:39, Mandar Kulkarni > wrote: > I could not figure out how Rdkit is guessing it as 2D structure, as there is > no such information in SDF. Line 2 of the SDF record looks something like: RDKit 2D This line has the format (quoting from the documentation): IIMMDDYYHHmmddSSssRR A2<--A8--><---A10-->A2I2<--F10.5-><---F12.5--><-I6-> ) where "User's first and last initials (l), program name (P), date/time (M/D/Y,H:m), dimensional codes (d) such as 2D or 3D, scaling factors (S, s), energy (E) if modeling program input, internal registry number (R)" In the example I gave, most of these fields are blank, except for the program name ("RDKit") and the dimension code "2D" in columns 21-22 (if I counted right). The dimension code indicates the structure is expected to be in 2D. > Is there any more information need to provide SDMolSupplier to make it > understand it's a 3D molecule? It is only a warning. RDKit interprets the molecule as 3D despite the warning. The file format documentation also says: "The “dimensional code” is maintained explicitly. Thus “3D” really means 3D, although “2D” will be interpreted as 3D if any non-zero Z-coordinates are found. Best regards, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] mol properties in SDWriter
On Sep 26, 2023, at 01:17, Ling Chan wrote: > >(1) > 4.099 .. > Just wonder what was the rationale behind this extra "(1)" on the property > field lines (pKa and logP in the above example)? > > And is there a way to get rid of these? I am not sure if this extra "(1)" is > part of the standard sd format. RDKit uses the increasing value as a sort of per-file registry number. This is follows the part of the standard which says "External registry numbers must be enclosed in parentheses." The relevant code is in Code/GraphMol/FileParsers/SDWriter.cpp : if (d_molid >= 0) { (*dp_ostream) << "(" << d_molid + 1 << ") "; } There is no way to suppress this output. No only is there no direct way to change the d_molid, but d_molid cannot be negative as Code/GraphMol/FileParsers/MolWriters.h declares it as: unsigned int d_molid; // the number of the molecules we wrote so far Wim suggested a post-processing approach. Another is to write the SD data items yourself, that is, use MolToMolBlock() to generate the connection table/molfile as a string, then iterate through the properties and generate the data items. import sys from rdkit import Chem def MolToSDFRecord( mol, includeStereo: bool = True, confId: int = -1, kekulize: bool = True, forceV3000: bool = False): mol_block = Chem.MolToMolBlock(mol, includeStereo, confId, kekulize, forceV3000) lines = [] for prop_name in mol.GetPropNames(): if "\n" in prop_name or ">" in prop_name or "<" in prop_name: sys.stderr.write(f"WARNING: Skipping property {prop_name!r} because the " "name includes an unsupported character.\n") continue prop_value = mol.GetProp(prop_name) if "\n" in prop_value: if "\n\n" in prop_value or "\r\n\r\n" in prop_value: sys.stderr.write(f"WARNING: Skipping property {prop_name!r} because the " "value includes an embedded newline.\n") continue if prop_value.endswith("\r\n"): prop_value = prop_value[:-2] elif prop_value.endswith("\n"): prop_value = prop_value[:-1] lines.append(f"> <{prop_name}>\n{prop_value}\n\n") lines.append("\n") return mol_block + "".join(lines) mol = Chem.MolFromSmiles("CCO") mol.SetProp("pKa","3.3\r\n") print(MolToSDFRecord(mol)) Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] question on complexity of cannonization
On Jun 16, 2023, at 03:15, S Joshua Swamidass wrote: > In graph theory, a planar graph is a graph that can be embedded in the plane, > i.e., it can be drawn on the plane in such a way that its edges intersect > only at their endpoints. In other words, it can be drawn in such a way that > no edges cross each other. Years ago at http://www.dalkescientific.com/writings/diary/archive/2012/05/18/nonplanar_compounds.html I did a search of a subset of 28.5 million PubChem structures and found 224 topologically non-planar examples, like https://pubchem.ncbi.nlm.nih.gov/compound/50919058 . I also gave some literature citations, like "Synthesis of the first topologically non-planar molecule" (1981) at https://www.sciencedirect.com/science/article/abs/pii/0040403981800779 . > On Jun 15, 2023, at 22:50, S Joshua Swamidass wrote: > > Have any other libraries adopted your approach? It's clever. It isn't my approach. It depends on which part you consider clever. I've been told some of the ideas can be traced back to Bernhard Rohde's PhD thesis, citation 20 in the paper ("who employed a stable numbering for equivalence classes instead of a sequential index"). Rohde is also thanked in the acknowledgements. And Rohde has/had his own in-house library at Novartis which included canonicalization. I wouldn't be surprised if NextMove has their own implementation, given how Roger Sayle is a co-author of the paper. > the covalent bonds of proteins of arbitrary size are a planar graph too, even > though most (all?) proteins have a 3D structure. FWIW, due to disulfide bonds in cystines, proteins can be topologically non-planar. https://academic.oup.com/nar/article/47/D1/D367/5223942?login=false give PDB entry 1AOC as an example, with their database entry at https://knotprot.cent.uw.edu.pl/view/1aoc/A/ . Best regards, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] question on complexity of cannonization
On Jun 15, 2023, at 20:49, S Joshua Swamidass wrote: > > And what (generally speaking) is the algorithm used by rdkit? Do we know it's > complexity? https://pubs.acs.org/doi/abs/10.1021/acs.jcim.5b00543 "Get Your Atoms in Order—An Open-Source Implementation of a Novel and Robust Molecular Canonicalization Algorithm" Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] question on complexity of cannonization
On Jun 15, 2023, at 18:20, S Joshua Swamidass wrote: > It's well known that the graph-isomorphism problem is NP While P is contained in NP, I don't think that's the NP you mean. I suspect you may be thinking of subgraph isomorphism, which is NP-hard. Graph isomorphism may be quasi-polynomial time, if Babai's (unpublished) claim is correct. Also, "Isomorphism of graphs of bounded valence can be tested in polynomial time" - https://www.sciencedirect.com/science/article/pii/002282900095 . > So here is my question. What are the cases that are very difficult to > canonize a graph? As I recall, handling chirality and other non-local properties is difficult. I have not worked on this problem. Cheers, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] ANN: chemfp 4.1
Hi everyone, I've just released chemfp 4.1. To install the pre-compiled package for Linux-based OSes do: python -m pip install chemfp -i https://chemp.com/packages/ For a detailed description of what's new, see: https://chemfp.readthedocs.io/en/latest/whats_new_in_41.html As a summary, the new features in this release include: - Supports RDKit 2023.03.1 and Python 3.8 through 3.11 - Interprets input SMILES as CXSMILES by default, with an option to turn that off - Can save/load similarity search results to a NumPy file in a form compatible with SciPy compressed sparse matrices - Implements Butina clustering, with several variations. While building the similarity matrix may take an hour, the result can be saved to an npz file that the Butina implementation can use as input. This can be useful when tuning the Butina parameters because the NxN matrix can be constructed once, at the lowest reasonable threshold, while the Butina clustering can use a higher threshold. It takes only a few seconds to cluster ChEMBL at a threshold of 0.6. - Sphere exclusion ("spherex") has been parallelized, with new options for specifying directed sphere exclusion ranking and a new output format compatible with the Butina output - The new "chemfp csv2fps" tool for generating fingerprints from CSV files containing identifiers and molecules. - The new "chemfp translate" tool for structure file format conversion. These are available for no cost under the Chemfp Base License Agreement at https://chemfp.com/BaseLicense.txt . For other licensing options, including no-cost license key for academic use, see https://chemfp.com/license/ . Best regards, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Can a bond index be associated with order in explicit SMILES?
On May 17, 2023, at 02:31, Vincent Scalfani wrote: > I thought that this might also be the case for bond indices, but that does > not appear to be correct (see example below). Is it possible to get a bond > index in the order of the SMILES? This may help you understand why that's a difficult question. What does the bond index mean in something like C12.OC23.C3.C1 ? Does the bond for closure 1 come first in the bond list, because that's where it start, or is it last, because that's where it ends? It looks like you think it should be the closure position. Here's your SMILES labelled by atom index: ┌1 1 1 1 1 1 1 11 1 atoms│ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 78 9 └ | | | | | | | | | | | | | | | | | || | SMILES[ C-C-c1:c:c:[nH]:c:1-C-C-C1-C-C-C(-c2:c:c:[nH]:c:2)-C-C-1 I used the program at the end of this email to print the information in bond list order: In bondlist order i Bnd# a1 ~ a2 frag 0 0 0 - 1 C-C 1 1 1 - 2 C-c 2 2 2 : 3 c:c 3 3 3 : 4 c:c 4 4 4 : 5 c:[nH] 5 5 5 : 6 [nH]:c 6 6 6 - 7 c-C 7 7 7 - 8 C-C 8 8 8 - 9 C-C 9 9 9 - 10 C-C 10 10 10 - 11 C-C 11 11 11 - 12 C-C 12 12 12 - 13 C-c 13 13 13 : 14 c:c 14 14 14 : 15 c:c 15 15 15 : 16 c:[nH] 16 16 16 : 17 [nH]:c 17 17 12 - 18 C-C 18 18 18 - 19 C-C 19 19 6 : 2 c:c 20 20 19 - 9 C-C 21 21 17 : 13 c:c If you step through them you'll see that the closure atoms (2-6, 9-19, and 13-17) are added to the bond list at the end, after processing the atoms which make up the spanning tree. It appears the closure bond have the begin and end atom indices with the largest first, which makes it possible to tell that a given bond is a closure bond. In principle then it should be possible to reorder the bonds to get the order you want. This proved trickier than I could manage in the time I have. Perhaps the better question is, why do you need the bond indices in a specific order? Cheers, Andrew da...@dalkescientific.com from rdkit import Chem bond_symbols = { Chem.BondType.SINGLE: "-", Chem.BondType.DOUBLE: "=", Chem.BondType.TRIPLE: "#", Chem.BondType.AROMATIC: ":", } smi = "CCc1cc[nH]c1CCC1CCC(CC1)c1cc[nH]c1" #smi = "[C@@](F)(Cl)(Br)O" mol1 = Chem.MolFromSmiles(smi) smi_explicit = Chem.MolToSmiles(mol1, allBondsExplicit=True) mol2 = Chem.MolFromSmiles(smi_explicit) def show(bonds): print(" i Bnd# a1 ~ a2 frag") for i, b in enumerate(bonds): a1, a2 = b.GetBeginAtomIdx(), b.GetEndAtomIdx() symbol = bond_symbols[b.GetBondType()] s = Chem.MolFragmentToSmiles(mol2, atomsToUse=[a1, a2], rootedAtAtom=a1, allBondsExplicit=True) print(f"{i:2d} {b.GetIdx():2d} {a1:2d} {symbol} {a2:2d} {s.center(8)}") print(smi_explicit) print("In bondlist order") show(mol2.GetBonds()) ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] how to get indexes and atoms with H from smiles
On May 9, 2023, at 07:55, Haijun Feng wrote: > Can anyone help me figure out how to get each atom with H from the smiles as > above. Thanks so much! Try using Chem.MolFragmentToSmiles to get the SMILES for each atom, with all hydrogens explicit, then strip off the leading and trailing []s. from rdkit import Chem mol=Chem.MolFromSmiles('c1c(C(N)=O)1') for i, atom in enumerate(mol.GetAtoms()): atom_smi = Chem.MolFragmentToSmiles(mol, allHsExplicit=True, atomsToUse=[atom.GetIdx()]) print(i, atom_smi.strip("[]")) This prints 0 cH 1 cH 2 cH 3 cH 4 cH 5 c 6 C 7 NH2 8 O Your code showed you using atom.SetProp('molAtomMapNumber',str(i)) In the following, I'll set that property *after* getting the atom SMILES, so the map is not included as part of the output: from rdkit import Chem mol=Chem.MolFromSmiles('c1c(C(N)=O)1') for i, atom in enumerate(mol.GetAtoms()): atom_smi = Chem.MolFragmentToSmiles(mol, allHsExplicit=True, atomsToUse=[atom.GetIdx()]) print(i, atom_smi.strip("[]")) atom.SetIntProp("molAtomMapNumber", i) print(Chem.MolToSmiles(mol)) which gives the output 0 cH 1 cH 2 cH 3 cH 4 cH 5 c 6 C 7 NH2 8 O [cH:0]1[cH:1][cH:2][cH:3][cH:4][c:5]1[C:6]([NH2:7])=[O:8] > the output is: [cH:0]1[cH:1][cH:2][cH:3][cH:4][c:5]1C:6=[O:8] For what it's worth, I get the slightly different: [cH:0]1[cH:1][cH:2][cH:3][cH:4][c:5]1[C:6]([NH2:7])=[O:8] You should be aware that the input order and the output SMILES order might be different. Because of the simpler structure of your preferred output SMILES format, you can alternatively extract the atom terms from the output string by looking for the substrings inside of the []s, as in the following: import re >>> re.compile(r'\[[^]]+\]').findall("[cH:0]1[cH:1][cH:2][cH:3][cH:4][c:5]1[C:6]([NH2:7])=[O:8]") ['[cH:0]', '[cH:1]', '[cH:2]', '[cH:3]', '[cH:4]', '[c:5]', '[C:6]', '[NH2:7]', '[O:8]'] This list will exactly match the output SMILES atom order. Cheers, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Permutation of multiple enumeration
Hi Carsten, How are the fragments expressed? With attachment points marked with "[*:1]", "[*:2]" and "[*:3]" atoms? One technique is to rewrite the SMILES to use closures. (See https://onlinelibrary.wiley.com/doi/10.1002/qsar.200310008 or http://www.dalkescientific.com/writings/diary/archive/2005/05/07/attachment_points.html ). For example, if your core SMILES are: [*:1]c1ncc([*:2])cn1 CC([*:2])O[*:1] and your R1 contains *F Cl* Br* and your R2 contains *CCO CO* then you could rewrite these to use "%91" to connect the [*:1] with the R1 "*" and use "%92" to connect the [*:2] with the R2 "*", using dot-disconnected terms. For example: [*:1]c1ncc([*:2])cn1 + *F + *CCO can be rewritten as c%911ncc%92cn1.F%91.C%92CO which is parsed and canonicalized to: OCCc1cnc(F)nc1 Rewriting the SMILES this way is a bit tricky. I've attached a program which does it for you. Running it on the above gives: % cat core.smi [*:1]c1ncc([*:2])cn1 CC([*:2])N[*:1] % cat r1.smi *F Cl* Br* % cat r2.smi *CCO CO* % python enumerate.py --R1 r1.smi --R2 r2.smi core.smi c1%91ncc%92cn1.F%91.C%92CO -> OCCc1cnc(F)nc1 c1%91ncc%92cn1.F%91.CO%92 -> COc1cnc(F)nc1 c1%91ncc%92cn1.Cl%91.C%92CO -> OCCc1cnc(Cl)nc1 c1%91ncc%92cn1.Cl%91.CO%92 -> COc1cnc(Cl)nc1 c1%91ncc%92cn1.Br%91.C%92CO -> OCCc1cnc(Br)nc1 c1%91ncc%92cn1.Br%91.CO%92 -> COc1cnc(Br)nc1 CC(O%91)%92.F%91.C%92CO -> CC(CCO)OF CC(O%91)%92.F%91.CO%92 -> COC(C)OF CC(O%91)%92.Cl%91.C%92CO -> CC(CCO)OCl CC(O%91)%92.Cl%91.CO%92 -> COC(C)OCl CC(O%91)%92.Br%91.C%92CO -> CC(CCO)OBr CC(O%91)%92.Br%91.CO%92 -> COC(C)OBr It also supports --R3 if your core has 3 R-groups, with the third core point labeled [*:3]. Best regards Andrew da...@dalkescientific.com """Enumerate a library with a core and up to 3 sets of R-groups The core and R-group files contain one SMILES per line. The core SMILES must use labeled "*" wildcards, like: [*:1]c1ncc([*:2])cn1 where [*:1] is the attachment point for R1, [*:2] is the attachment point for R3, and [*:3] is the attachment point for R3. The R-group SMILES must have a single unlabled "*" wildcard, like: CO* -or- C(*)CO The program is used like this: python enumerate.py --R1 r1.smi --R2 r2.smi --R3 r3.smi core.smi """ # Written by Andrew Dalke # 6 July 2022 import argparse from rdkit import Chem import itertools # Generate a SMILES string such that the wildcard is NOT the first atom. # This makes it easier to do an in-place string substitution of the # wildcard term with a ring closure. # # The approach finds a non-wildcard atom and uses that as the root # to generate the SMILES. # # Also verify the number of wildcard atoms is correct, and that # they have one (and only one) single bond, no isotope, and no charge. # # Also verify that if it's supposed to be labeled (as for the core # structures) then it uses [*:1], [*:2], or [*:3], and if it's # not labeled (as for the R-groups) then it does not use labels def get_reordered_smiles(mol, num_wildcards_required, is_labeled): num_wildcards = 0 non_wildcard_atom = None for atom in mol.GetAtoms(): # Do some validation if atom.GetAtomicNum() == 0: bonds = atom.GetBonds() if len(bonds) != 1: raise ValueError("'*' atom must have only one bond") if bonds[0].GetBondType() != Chem.BondType.SINGLE: raise ValueError("'*' atom must have only one single bond") if atom.GetFormalCharge() != 0: raise ValueError("'*' atom must be uncharged") if atom.GetIsotope() != 0: raise ValueError("'*' atom must have no isotope") num_wildcards += 1 elif non_wildcard_atom is None: # found a non-wildcard non_wildcard_atom = atom if num_wildcards != num_wildcards_required: raise ValueError( f"expecting {num_wildcards_required} '*' atoms, found {num_wildcards}") if non_wildcard_atom is None: raise ValueError( "R-group must have at least one non-'*' atom") # The generated SMILES does not begin with a wildcard. smiles = Chem.MolToSmiles(mol, rootedAtAtom=non_wildcard_atom.GetIdx()) # Double check some assertion if is_labeled: # Ensure enough labeled wildcards are present for substr in ("[*:1]", "[*:2]", "[*:3]")[:num_wildcards_required]: if substr not in smiles: raise ValueError(f"Expecting {substr}") else: # Ensure it isn't using labeled wildcards if "[*:" in smiles: raise ValueError(f"Must not be labeled wildcards")
[Rdkit-discuss] chemfp 4.0
Hi all, I've recently released chemfp 4.0, with support for several diversity selection algorithms, and an improved API for interactive use in a notebook environment. Chemfp is an analytics package for cheminformatics fingerprints. It contains command-line tools and an extensive Python library for fingerprint generation, high-performance similarity search, diversity selection, and exploratory research. The new diversity selection algorithms are MaxMin, sphere exclusion (both random and directed), and HeapSweep. Of special note, The MaxMin algorithm has improved support for selecting diverse fingerprints from a set of candidates (eg, a vendor catalog) which must also be diverse from a set of references (eg, a corporate collection), which is over an order of magnitude faster than the standard MaxMin algorithm. People who live in the Jupyter notebook will likely enjoy the new chemfp user experience. Most long-term actions support progress bars, chemfp's Python objects have more informative repr()s, search results added Pandas integration, and there are new high-level APIs that let you express a lot of functionality compactly. See https://chemfp.readthedocs.io/en/latest/whatsnew.html for more details. To install the pre-compiled package for Linux at no cost, use: python -m pip install chemfp -i https://chemfp.com/packages/ The Base License covers most in-house use of chemfp, though a few features are either limited or disabled and require a license key to unlock. For alternative licenses, including source code and no-cost academic licensing, see https://chemfp.com/license/ -- or try one of the re-formatted ChEMBL datasets at https://chemfp.com/datasets/ which include an embedded authorization key. Chemfp is not a cheminformatics toolkit. Instead, it knows how to use RDKit (or Open Babel, OpenEye's OEChem/OEGraphSim, or the CDK) for molecule I/O and fingerprint generation. For more information see the chemfp home page at https://chemfp.com/ or contact me directly. Best regards, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] how to report SDF records for which Chem.ForwardSDMolSupplier returns None?
On Apr 14, 2022, at 12:57, Ivan Tubert-Brohman wrote: > How about splitting the file on lines consisting of "", and then parsing > each record? If the parsing fails, you can write out the bad record for > future inspection. (This addresses the basic use case, but not the "even > better" one.) Yes, if you know your data is "clean", then you can do that. I wrote an essay at http://www.dalkescientific.com/writings/diary/archive/2020/09/18/handling_the_sdf_record_delimiter.html about some of the ways that can cause problems. They do occur in real-world data sets. And they do cause problems in some processing pipelines. Public data sets like PubChem, ChEMBL, etc. don't have these problems. They are mostly in in-house data sets. Though it's not common to have a problem. > def read_record(fh): > lines = [] > for line in fh: > lines.append(line) > if line.rstrip() == '': > return ''.join(lines) See also https://baoilleach.blogspot.com/2020/05/python-patterns-for-processing-large.html . The reasons I think there should be a low-level library for this sort of work are: 1) the edge cases are tricky to handle, 2) the simple readers like this are slow 3) I believe good error reporting needs things like the starting line number and/or starting byte position for the record. Implementing that is a bit tricky (and boring), and tracking that information in a compiled extension has a much lower overhead than doing it in Python. Cheers, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] how to report SDF records for which Chem.ForwardSDMolSupplier returns None?
On Apr 14, 2022, at 09:16, Gyro Funch wrote: > I don't know the sdf format well, so please excuse my ignorance, but instead > of a custom parser, would it be possible to write a preprocessor to eliminate > the offending information? Perhaps something using regular expressions in > python, perl, sed, or awk? The SDF format is too complicated to be parsed with a regular expression[1], and the failure modes often cannot be detected at the syntax level[2]. I suggest people may consider using chemfp for this [3]. [1] For example, in a V2000-formatted record, the number of atom records and the number of bond records are given by a repeat count. A traditional/formal regular expression does not support counts where the count from the pattern matching. Most regular expression engines have more powerful capabilities than formal regular expression, such as matches to back-reference captured groups. However, few support using a backreference as a repeat count. I wrote one that did, which would let you specify (?P...)(?P...) and so on) (?P(?P.{10})(?P.{10}) and so on){atom_count} (?P(?P.{3})(?P.{3}) and so on){bond_count} but in practice, defining the grammar through a regular expression grammar was decidedly not easy! I've wanted to experiment with using WUFFS to make a low-level SDF parser library, see https://github.com/google/wuffs [2] For example, RDKit by default rejects atoms where the valence is too high. Detecting this in filter code calls for reverse-engineering what RDKit already does. [3] Chemfp is best known as a fingerprint generation and search program. However, there are a few use cases where I wanted to have access to the input record (eg, to detect toolkit failures, or to add fingerprint data to the input record rather than round-tripping the SDF through a toolkit.) I did this by writing my own SDF record reader (in the "text_toolkit"), and writing a wrapper to the RDKit toolkit (in the "rdkit_toolkit"), and using a error handler which can decide how to handle errors (ignore, report, raise an exception, log, etc.). That error handler has access to location information, which includes the record number, the record text, the line number of the start of the record, and more. Here's what it looks like for Giovanni's use case: from chemfp import rdkit_toolkit as T from chemfp import text_toolkit filename = "/Users/dalke/databases/ChEBI_complete_3star.sdf.gz" class ErrorHandler: def __init__(self): self.error_ids = [] def error(self, msg, location, extra=None): record = location.record chebi_id = text_toolkit.get_sdf_tag(location.record, "ChEBI ID") print(f"!!! Error reading record {location.recno} with ID: {chebi_id!r}") print(f"at {location.where()}") self.error_ids.append(chebi_id) errors = ErrorHandler() count = 0 num_atoms = 0 for mol in T.read_molecules(filename, errors=errors): count += 1 num_atoms += mol.GetNumAtoms() # This is a RDMol. print(f"Parsed {count} records ({num_atoms} atoms), skipped {len(errors.error_ids)}.") This functionality is available in the pre-compiled version of chemfp for Linux-base OSes, available from https://chemfp.com/download/ . The default license agreement (that is, you can use it without a license key) lets you use it for any internal purpose. If anyone is interested in working on a stand-alone SDF parsing library under a free software license, I can provide some pointers and feedback, and will contribute chemfp's SDF parser under the MIT license. Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] mmpdb 3.0b1
Hi all, The combination of crowd-funding and contract work for me, and methods + software development by Mahendra Awale, has resulted in a new version of mmpdb. More specifically, version 3.0 beta 1 is available on GitHub at: https://github.com/adalke/mmpdb/tree/v3-dev The CHANGELOG summary is at the bottom of the email. For many people the biggest improvement is probably the support for large data sets, and the switch to a more human-understandable SMARTS/"pseudo-SMILES" for the environment fingerprints. The documentation is available through the program, starting with 'mmpdb --help'. Try it out, kick the tires, and let me know what fell off! Cheers, Andrew da...@dalkescientific.com A large number of changes to merge three different development tracks and add new features. The "fragments" file format has been replaced with a SQLite-based "fragdb" file format. This makes it much easier to develop tools to work on fragment data sets instead of processing a JSON-Lines file. New functionality to create an MMP data set in a distributed compute environment. Some of the features are: - split a SMILES file into a set of smaller SMILES files - the default "fragment" file output is now based on the input name - fragment files can be re-partitioned by constant fragments: - the "fragdb_constants" file generates fragment information - the "fragdb_partition" create re-partitioned fragdb files - the default "index" file output is now based on the input name - there are tools to merge fragdb and mmpdb files into one As a result, mmpdb can now handle significantly larger data sets. Added support for Postgres for direct index database creation. (The new distributed compute tools require SQLite.) Added a new "generate" command to apply 1-cut transforms to a structure, using MMP rules as a playbook. Replaced the SHA256-based Morgan fingerprint signature with a canonical SMARTS representing the Morgan fingerprint environment. This is difficult to understand or depict, so also include a "pseudo" SMILES that can be parsed by RDKit (if sanitize is disabled) and drawn. The new environment fingerprint also include the SMARTS of its parent, that is, the SMARTS with a smaller radius. Switched to 'click' for command-line parsing, removed the vendered version of the peewee ORM, and switched to a modern "pyproject.toml" project configuration with a setup.cfg which declares its dependencies. ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] generating smiles using RDKit
Hi Gyro, > On Dec 8, 2021, at 11:02, Gyro Funch wrote: > > My work is in the area of toxicology and I am interested in generating SMILES > for molecules referred to as 'short chain chlorinated paraffins' (SCCP). > > A general definition that is sometimes used is that an SCCP is given by the > molecular formula > > C_{x} H_{2x-y+2} Cl_{y} > > where > > x = 10-13 > y = 3-12 > > and the average chlorine content ranges from 40-70% by mass. > > - > > Can anyone provide guidance on how to generate the list of SMILES > corresponding to the above rules? Here's an alternate approach to the ones presented so far. https://gist.github.com/adalke/e62df8774032560fef750fa9c88b6516 Like Wim's version, it also generates the SMILES as the syntax level, though by default it use RDKit to generate canonical SMILES output. (use --no-canonical to disable the canonicalization step, which is also faster.) Here it is with 4 carbons and 3 chlorines. % python sccp_smiles.py --C 4 --Cl 3 Content range not specified. Using --min-content 0.4 and --max-content 0.7. (Cl)(Cl)Cl CCC(Cl)C(Cl)Cl CCC(Cl)(Cl)CCl CC(Cl)CC(Cl)Cl CC(Cl)C(Cl)CCl CC(Cl)C(C)(Cl)Cl CC(Cl)(Cl)CCCl Cl(Cl)Cl ClCCC(Cl)CCl The "--C" and "--Cl" are aliases for "--min-C" and "--min-Cl"; if the maximums are not specified then the maximum is set to the minimum. Here's a range using all the bells and whistles: % time python sccp_smiles.py --min-C 10 --max-C 13 --min-Cl 3 --max-Cl 12 --max-Cl-per-atom 2 --min-content 0.4 --max-content 0.7 --no-canonicalize > example.smi 2.030u 0.156s 0:02.44 89.3% 0+0k 0+0io 0pf+0w % wc -l example.smi 440334 example.smi Wim reported 437001 for the same configuration. I haven't figured out if the difference is due simply to differences in the molecular weight values. I couldn't canonicalize and pin down the differences in part because Wim's output generates SMILES strings that RDKit cannot parse: % grep '^[(]' CSSP.smi | head -4 (Cl)C(Cl)(Cl)C(Cl)(Cl) (Cl)C(Cl)(Cl)(Cl)C(Cl)(Cl) (Cl)C(Cl)(Cl)(Cl)(Cl)C(Cl)(Cl) (Cl)C(Cl)(Cl)CCC(Cl)CC(Cl)(Cl) >>> from rdkit import Chem >>> Chem.CanonSmiles("(Cl)C(Cl)(Cl)CCC(Cl)CC(Cl)(Cl)") [14:31:12] SMILES Parse Error: syntax error while parsing: (Cl)C(Cl)(Cl)CCC(Cl)CC(Cl)(Cl) My code isn't well tested, but perhaps enough to get you on the way. Cheers, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Reading text records from SDF from gzipped files
Hi Tim, You might also consider using chemfp, which has this sort of functionality available through its toolkit wrapper API: from chemfp import rdkit_toolkit as T import itertools with T.read_ids_and_molecules("chembl_28.sdf.gz") as reader: loc = reader.location for id, mol in itertools.islice(reader, 5): print(f"Record: {loc.recno} ({id}) line: {loc.lineno} offsets: {loc.offsets}") counts_line = loc.record.splitlines()[3] num_atoms, num_bonds = int(counts_line[:3]), int(counts_line[3:6]) print(f" counts line #atoms: {num_atoms} #bonds: {num_bonds}") print(f"RDKit #atoms: {mol.GetNumAtoms()} #bonds: {mol.GetNumBonds()}") The output in this case is: Record: 1 (CHEMBL153534) line: 1 offsets: (0, 1458) counts line #atoms: 16 #bonds: 17 RDKit #atoms: 16 #bonds: 17 Record: 2 (CHEMBL440060) line: 43 offsets: (1458, 18699) counts line #atoms: 206 #bonds: 208 RDKit #atoms: 202 #bonds: 204 Record: 3 (CHEMBL440245) line: 466 offsets: (18699, 39688) counts line #atoms: 251 #bonds: 254 RDKit #atoms: 251 #bonds: 254 Record: 4 (CHEMBL440249) line: 980 offsets: (39688, 56050) counts line #atoms: 194 #bonds: 205 RDKit #atoms: 185 #bonds: 196 Record: 5 (CHEMBL405398) line: 1388 offsets: (56050, 58447) counts line #atoms: 27 #bonds: 30 RDKit #atoms: 27 #bonds: 30 You can also work more directly to the record tokenization level, and pass each record to the rdkit_toolkit wrapper: from chemfp import text_toolkit with text_toolkit.read_sdf_records("chembl_28.sdf.gz") as reader: for rec in itertools.islice(reader, 5): mol = T.parse_molecule(rec, "sdf") print(mol.GetProp("chembl_id"), "has", len(rec), "bytes") which prints CHEMBL153534 has 1458 bytes CHEMBL440060 has 17241 bytes CHEMBL440245 has 20989 bytes CHEMBL440249 has 16362 bytes CHEMBL405398 has 2397 bytes Andrew da...@dalkescientific.com > On Nov 4, 2021, at 17:55, Tim Dudgeon wrote: > > Thanks Paolo, that's fantastic. > The first option was what I needed. > Tim > > On Thu, Nov 4, 2021 at 4:36 PM Paolo Tosco wrote: > Hi Tim, > > if you need access to the original text, you'll have to do the chunking > yourself, e.g.: > > import gzip > > def molgen(hnd): > mol_text_tmp = "" > while 1: > line = hnd.readline() > if not line: > return > line = line.decode("utf-8") > mol_text_tmp += line > if line.startswith(""): > mol_text = mol_text_tmp > mol_text_tmp = "" > yield mol_text ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] MolToSmiles atom ordering
Hi Ling, If there are symmetries then a substructure search like will only give you one mapping, and that might not be the canonical mapping. What you're looking for is the special property _smilesAtomOutputOrder >>> from rdkit import Chem >>> mol = Chem.MolFromSmiles("O=C(NCc1cc(OC)c(O)cc1)/C=C/C(C)C") >>> Chem.MolToSmiles(mol) 'COc1cc(CNC(=O)/C=C/C(C)C)ccc1O' >>> mol.GetProp("_smilesAtomOutputOrder") '[8,7,6,5,4,3,2,1,0,13,14,15,16,17,18,19,20,21,12,11,9,10,]' Here are the atom indices of the original SMILES: ┌ 1 11 1 1 1 2 2 atoms│ 0 1 234 56 78 9 0 12 3456 7 8 9 0 1 └ | | ||| || || | | || | | | | | SMILES[ O=C(NCc1cc(OC)c(O)cc1)/C=C/C(C)C You can see the first atom of the output is a "C", which is mapped to position 8 in the _smilesAtomOutputOrder, which is the "...C)..." in the original SMILES, etc. Cheers, Andrew da...@dalkescientific.com > On Nov 3, 2021, at 00:18, Ling Chan wrote: > > O.K. Problem solved. Sorry about the spam, folks. > > I can use GetSubstructMatch, as follows. > > # sinput is the input smiles > # scanon is the output smiles > > minput = Chem.MolFromSmiles(sinput) > scanon=Chem.MolToSmiles(minput) > mcanon=Chem.MolFromSmiles(scanon) > map_forward = minput.GetSubstructMatch(mcanon) > map_backward = mcanon.GetSubstructMatch(minput) > > > > > Ling Chan 於 2021年11月2日週二 下午3:55寫道: > Dear colleagues, > > Just wonder if I can obtain a mapping of the atom indices upon > canonicalization by MolToSmiles ? I am aware that canonicalization (and hence > atom reordering) can be suppressed in MolToSmiles, but I do want to > canonicalize the output smiles. > > If you are interested, here is a bit more details of my problem. For each > molecule, I want to delete one or two side chains, and obtain a smiles of > what is left. Just that I want to know what are the atoms that bonded to the > deleted side chains. I know, by suppressing canonicalization things will > work. But I would like to canonicalize the smiles so that I can know if there > are duplicates. > > I tried marking the atoms. But I believe that properties that got carried > over to the output smiles, e.g. Isotope, affect the canonicalization, while > properties that do not affect canonicalization, e.g, IntProp, are lost upon > the conversion to smiles. > > Thank you for your insight. > > Ling > ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] MolToSmiles
> On Oct 21, 2021, at 04:50, Ling Chan wrote: > > I got the attached sdf. When I did a MolToSmiles, it gives me the following. > > >>> for m in Chem.SDMolSupplier("pdb_structures/1q6k_ligand.sdf"): > ... print (Chem.MolToSmiles(m)) > ... > [CH3:0][C:0]([CH3:0])([CH3:0])[O:0][C:0](=[O:0])[NH:0][CH:0]([CH:0]=[O:0])[CH:0]1[CH2:0][CH2:0][CH2:0][CH2:0][CH2:0]1 > > Just wonder why does it not give something like > O=C(OC(C)(C)C)NC(C=O)C1C1 The terms after the atom symbol in your atom block lines are center-justified (or left-justified, in the 2-digit mass difference term 'dd') instead of right-justified. Here's a comparison of your first atom line, compared with the ctfile spec, and then compared with the round-trip through RDKit: 74.0060 -9.5770 134.8660 N 0 0 0 0 0 0 0 0 0 0 0 0<-- yours x.y.z. aaaddcccssshhhbbbvvvHHHrrriiimmmnnneee <-- spec 74.0060 -9.5770 134.8660 N 0 0 0 0 0 0 0 0 0 0 0 0 <-- RDKit Add a space after the atom symbol field ("aaa") and everything works. What happened? The ":0" in the SMILES string derives from the atom-atom mapping number, "mmm", in the SDF. The relevant code from Code/GraphMol/FileParsers/MolFileParser.cpp::ParseMolFileAtomLine() is: if (text.size() >= 63 && text.substr(60, 3) != " 0") { int atomMapNumber = 0; try { atomMapNumber = FileParserUtils::toInt(text.substr(60, 3), true); } catch (boost::bad_lexical_cast &) { std::ostringstream errout; errout << "Cannot convert '" << text.substr(60, 3) << "' to int on line " << line; delete res; throw FileParseException(errout.str()); } res->setProp(common_properties::molAtomMapNumber, atomMapNumber); } This says that if the field isn't exactly " 0" then parse it as an integer and store it in the atom's molAtomMapNumber. Since your " 0 " field isn't exactly " 0", it gets converted into the atom map value of 0. I don't see an explicit statement in the spec about alignment in fields. It's clear the spec comes from a Fortran background, so these should be interpreted as "I2" and "I3", and right-justified. By the way, if you pass your file through CDK you get: org.openscience.cdk.io.MDLV2000Reader ERROR: Error while parsing line 5: 74.0060 -9.5770 134.8660 N 0 0 0 0 0 0 0 0 0 0 0 0 -> invalid line length, 68:74.0060 -9.5770 134.8660 N 0 0 0 0 0 0 0 0 0 0 0 0 org.openscience.cdk.io.iterator.IteratingSDFReader ERROR: Error while reading next molecule: invalid line length, 68:74.0060 -9.5770 134.8660 N 0 0 0 0 0 0 0 0 0 0 0 0 CDK's storage/ctab/src/main/java/org/openscience/cdk/io/MDLV2000Reader.java::readAtomFast() requires that either all characters of a field be present, or the end of line. Your line is 68 characters long because your last field is " 0" instead of the " 0 " needed to match the exact charge flag "eee". Best regards, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Maximum Common Substructure using SMARTS
On Jul 23, 2021, at 06:42, Andrew Dalke wrote: > > No, there's no way to do that. > > The best I can suggest is to go back to the original Python implementation > and change the code leading up to Alternatively, since your template is small, you can brute-force enumerate all possible matching SMARTS patterns, and test them from largest to smallest. I believe the following patterns are correct for your template. These are ordered by number of bonds, then number of atoms, then ASCII-betically. (Note: these many contain duplicates because Chem.MolToSmarts doesn't produce canonical SMARTS.) [n,c,o]1(-S(-*)(=O)=O):[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]:1 [n,c,o](-S(-*)(=O)=O):[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o] [n,c,o](-S(-*)(=O)=O)(:[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]):[n,c,o] [n,c,o](-S(-*)(=O)=O)(:[n,c,o]:[n,c,o]:[n,c,o]):[n,c,o]:[n,c,o] [n,c,o](-S(-*)(=O)=O)(:[n,c,o]:[n,c,o]):[n,c,o]:[n,c,o]:[n,c,o] [n,c,o](-S(-*)(=O)=O)(:[n,c,o]):[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o] [n,c,o]1(-S(=O)=O):[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]:1 [n,c,o](-S(=O)=O):[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o] [n,c,o](-S(=O)=O)(:[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]):[n,c,o] [n,c,o](-S(=O)=O)(:[n,c,o]:[n,c,o]:[n,c,o]):[n,c,o]:[n,c,o] [n,c,o](-S(=O)=O)(:[n,c,o]:[n,c,o]):[n,c,o]:[n,c,o]:[n,c,o] [n,c,o](-S(=O)=O)(:[n,c,o]):[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o] [n,c,o](-S(-*)(=O)=O):[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o] [n,c,o](-S(-*)(=O)=O)(:[n,c,o]:[n,c,o]:[n,c,o]):[n,c,o] [n,c,o](-S(-*)(=O)=O)(:[n,c,o]:[n,c,o]):[n,c,o]:[n,c,o] [n,c,o](-S(-*)(=O)=O)(:[n,c,o]):[n,c,o]:[n,c,o]:[n,c,o] [n,c,o](-S(=O)=O):[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o] [n,c,o](-S(=O)=O)(:[n,c,o]:[n,c,o]:[n,c,o]):[n,c,o] [n,c,o](-S(=O)=O)(:[n,c,o]:[n,c,o]):[n,c,o]:[n,c,o] [n,c,o](-S(=O)=O)(:[n,c,o]):[n,c,o]:[n,c,o]:[n,c,o] [n,c,o](-S(-*)(=O)=O):[n,c,o]:[n,c,o]:[n,c,o] [n,c,o](-S(-*)(=O)=O)(:[n,c,o]:[n,c,o]):[n,c,o] [n,c,o](-S(-*)(=O)=O)(:[n,c,o]):[n,c,o]:[n,c,o] [n,c,o](-S(=O)=O):[n,c,o]:[n,c,o]:[n,c,o] [n,c,o](-S(=O)=O)(:[n,c,o]:[n,c,o]):[n,c,o] [n,c,o](-S(=O)=O)(:[n,c,o]):[n,c,o]:[n,c,o] [n,c,o](-S(-*)(=O)=O):[n,c,o]:[n,c,o] [n,c,o](-S(-*)(=O)=O)(:[n,c,o]):[n,c,o] [n,c,o](-S(=O)=O):[n,c,o]:[n,c,o] [n,c,o](-S(=O)=O)(:[n,c,o]):[n,c,o] [n,c,o](-S(-*)(=O)=O):[n,c,o] [n,c,o]-S(-*)(=O)=O [n,c,o](-S(=O)=O):[n,c,o] [n,c,o]-S(=O)=O S(-*)(=O)=O S(=O)=O I generated it with the following: === from rdkit import Chem import itertools # Must have the atoms marked with an atom map (the atom map value is ignored). template = '[n,c,o]1(-[S:1](-*)(=[O:1])=[O:1]):[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]:1' mol = Chem.MolFromSmarts(template) # Figure out which bonds to keep bond_atom_indices = [] for bond in mol.GetBonds(): if all(atom.HasProp("molAtomMapNumber") for atom in (bond.GetBeginAtom(), bond.GetEndAtom())): continue bond_atom_indices.append((bond.GetBeginAtomIdx(), bond.GetEndAtomIdx())) # Remove the atom maps for atom in mol.GetAtoms(): if atom.HasProp("molAtomMapNumber"): atom.ClearProp("molAtomMapNumber") seen = set() # Enumerate all possible bonds to delete (should be 2**n) for r in range(0, len(bond_atom_indices)+1): for delete_indices in itertools.combinations(bond_atom_indices, r): tmp_mol = Chem.RWMol(mol) # Remove the selected bonds for atom1_idx, atom2_idx in delete_indices: tmp_mol.RemoveBond(atom1_idx, atom2_idx) # Remove any singletons. Start from the end so the indices are stable. for atom in list(tmp_mol.GetAtoms())[::-1]: if not atom.GetBonds(): tmp_mol.RemoveAtom(atom.GetIdx()) # Get the corresponding SMARTS tmp_smarts = Chem.MolToSmarts(tmp_mol) # Ensure it's singly connected if "." in tmp_smarts: continue # Ensure it's unique; track the number of bonds and atoms for later sorting key = (tmp_mol.GetNumBonds(), tmp_mol.GetNumAtoms(), tmp_smarts) seen.add(key) for num_bonds, num_atoms, smarts in sorted(seen, reverse=True): print(smarts) === Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Maximum Common Substructure using SMARTS
On Jul 23, 2021, at 01:01, Gustavo Seabra wrote: > I actually want the sulfone to be found, if it is there. My problem is that I > also want flexibility to change the ring atoms and still find the ring as a > match, while considering a match on the sulfone only if it really is there. > (e.g., CF3 should *not* match.) Does it make sense? Ahh, I see. No, there's no way to do that. The best I can suggest is to go back to the original Python implementation and change the code leading up to https://hg.sr.ht/~dalke/fmcs/browse/fmcs.py?rev=tip#L1929 so the initial seed is the sulfone instead of an (atom, bond, atom). Then use that to the the MCS with the sulfone, and if that fails, use RDKit's existing method. I point to my repository only because that's in Python and I know it better. If your C++ skills are better than mine, you might change the corresponding implementation in RDKit. Cheers, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Maximum Common Substructure using SMARTS
Hi Gustavo, > template = > Chem.MolFromSmarts('[a]1(-[S](-*)(=[O])=[O]):[a]:[a]:[a]:[a]:[a]:1') Unless things have changed since I last looked at the algorithm, you can't meaningfully pass a SMARTS-based query molecule into the MCS program, outside of a few simple cases. It generates a SMARTS pattern based on the properties of the molecule. You asked it to CompareElements, but those [a] terms all have an atomic number of 0. >>> template = Chem.MolFromSmarts('[a#1]1(-[S](-*)(=[O])=[O]):[a#1]:[a#1]:[a#1]:[a#1]:[a#1]:1') >>> [a.GetAtomicNum() for a in template.GetAtoms()] [0, 16, 0, 8, 8, 0, 0, 0, 0, 0] That's why your CompareAny search returns the #0 terms, like: '[#16,#6](-[#0,#6])(=,-[#8,#9])(=,-[#8,#9])-[#0,#6]1:[#0,#6]:[#0,#6]:[#0,#6]:[#0,#6]:[#0,#7]:1' > I would appreciate some pointers on how it would be possible to find the > maximum common substructure of 2 molecules, where in the template structure > some atoms may be *any*, but some other atoms must be fixed. Perhaps with isotope labelling? That is, label the "any" atoms as isotope 1, and label your -[S](=[O])(=[O])- as -[2S](=[3O])(=[3O])- Then use rdFMCS.AtomCompare.CompareIsotopes . If there's anything you don't want to match at all, give each atom a unique isotope value. Best regards, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Shape Tanimoto distance question
> On Jun 30, 2021, at 04:20, Francois Berenger wrote: > > On 29/06/2021 12:26, Greg Landrum wrote: >> Hi Leon, >> You can convert the tanimoto distance to similarity, but the formula >> is: >> Similarity = 1 - Distance > > In other words: > > Tanimoto_distance = 1.0 - Tanimoto_score As a side note, Rogers and Tanimoto (1960) define their distance as distance = - log_2 (similarity) The Jaccard distance is 1 - similarity https://en.wikipedia.org/wiki/Jaccard_index#Other_definitions_of_Tanimoto_distance Yes, this confusing set of terms is an annoying nuisance. Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] off_coverage, Z3, and test set reduction
Hi all, I'm excited about a tool I developed for the Open Force Field Initiative and thought to share a bit about it here. It's called "off_coverage", currently in a pull-request at https://github.com/openforcefield/cheminformatics-toolkit-equivalence and also available from my Sourcehut repo at https://hg.sr.ht/~dalke/off_coverage . The overall goal was to improve test case selection. Quite a few of the tests use an SDF test set with 371 records. These were created to test atom type assignment, and re-purposed as general-purpose test cases. At the beginning this was fine, but now the unittests take a long time to run. It's also likely that some of the records do not add additional testing strength. For example, several of them have valences that RDKit does not accept, and there doesn't appear useful to have multiple test cases for that. My solution casts this into a feature space problem. There are any number of features: can RDKit parse the structure? Can the OpenFF toolkit convert the structure into its internal molecule representation? Can it convert the structure back to an RDKit molecule and generate a SMILES? Does the resulting SMILES match the original? These might be expressed as a mapping from id to a list of features: id123 parse_ok to_openff_ok from_openff_ok id456 parse_ok to_openff_ok from_openff_err id789 parse_err id999 parse_ok to_openff_ok from_openff_ok Assuming all useful features can be detected, test set minimization reduces to the set cover problem - https://en.wikipedia.org/wiki/Set_cover_problem . These can be solved with off-the-shelf tools. I used Z3, https://github.com/Z3Prover/z3 , which solves it by expressing the problem as funding a solution to a Boolean-and of a set of Boolean-ors: (and (or id123 id456 id999); parse_ok id789 ; parse_err (or id123 id456 id999); to_openff_ok (or id123 id999) ; from_openff_ok id456 ; from_openff_err ) while minimizing sum(id123 + id456 + id789 + id999). In this case id789 and id456 must be selected, as well as one id123 or id999. This general approach could be used for all sorts of things, like finding a reduced set of fingerprints whose union still sets all of the bits set by a larger set of fingerprints, selecting a subset of structures which contain all of the atom types in the larger structure, etc. (It could also be modified to require at least 2 representatives, etc.) The novel part (to me) is that code coverage - that is, the lines of code that Python executes for a given test - can also be used to generate features. This information is available via Python's sys.settrace() hook. For example, you might track the sequence of [(module name, line number), ] for every function call. If two functions have the same sequence, then they execute exactly the same code. This is a very strict definition of equivalence. Or, you might track the set of lines executed for each module, and compare those sets. This is more like comparing the overall code coverage for a module. Or, you might track things at the function call level, rather than the module level, in case a test calls a function multiple times, each time possibly executing a different code path. Or you might look at the pairs of (previous line number, current line number), which captures some of the branching behavior. I implemented these variations. Assuming I did it correctly, the minimal subset can be as small as 27 records, out of 371! • mod-cover selected: 72/371 • mod-sequence selected: 342/371 • mod-pairs selected: 72/371 • func-cover selected: 31/371 • func-sequence selected: 336/371 • func-pairs selected: 27/371 In addition, the tool lets you specify an alternative weighting scheme. The default is that each entry has a weight of 1, but you might prefer to minimize the number of atoms, or minimize the number of features in each entry, or to prefer entries from one data set over another. I designed the tool to usable even outside of the OpenFF installation. If you're only interested in the selection part, the "minimize" subcommand takes a line-oriented text file with feature information (id, feature labels, and feature weights) as input, and a list of selected ids as output. It depends only on the "z3-solver" Python package from PyPI. I hope people find set cover tool and/or the idea of coverage-directed test set minimization to be interesting. And perhaps someone is interested enough in the latter to see if I implemented my settrace handlers correctly? ;) Best regards, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Are the path-based fingerprints formally described in the scientific literature?
On May 20, 2021, at 03:17, Francois Berenger wrote: > Weren't the path-based FPs formally described somewhere? What does "formally" mean? Daylight was rarely participated in the academic literature tradition. They instead preferred to publish their information directly, as Pat mentions: On May 21, 2021, at 02:26, Patrick Walters wrote: > There's also some information on path fingerprints in the Daylight Theory > Manual > https://www.daylight.com/dayhtml/doc/theory/theory.finger.html If you're looking for a citation, you can try something like: Daylight Theory Manual 4.9, Daylight Chemical Information Systems, Inc., Laguna Niguel, CA. https://www.daylight.com/dayhtml/doc/theory/theory.finger.html You can see this page is the source for the general understanding. Compare Daylight's: The fingerprinting algorithm examines the molecule and generates the following: • a pattern for each atom • a pattern representing each atom and its nearest neighbors (plus the bonds that join them) • a pattern representing each group of atoms and bonds connected by paths up to 2 bonds long • ... atoms and bonds connected by paths up to 3 bonds long • ... continuing, with paths up to 4, 5, 6, and 7 bonds long. ... the output of which is a set of bits (typically 4 or 5 bits per pattern); the set of bits thus produced is added (with a logical OR) to the fingerprint. with, for example, the textbook "An Introduction to Chemoinformatics" by Leach and Gillet: It is produced by generating all possible linear paths of connected atoms through the molecule containing between one and a defined number of atoms (typically seven). ... Each of these paths in turn serves as the input to a second program that uses a hashing procedure to set a small number of bits (typically four or five) to “1” in the fingerprint bitstring. The "typically" in the latter is because that's Daylight's algorithm's default. What I've learned, in researching this detail, is that Daylight's "typically 4 or 5 bits per pattern" used a size 'n' which was a function of the path length. That detail was known to a few Daylight users that I've talked to, and I've been told it was guided by information theoretical reasoning. In RDKit and Open Babel, 'n' is fixed for the given fingerprint type. In Indigo, which includes non-linear subgraphs, 'n' is a function of both the size and the internal symmetry. The Indigo authors said that was more effective in their experiments. (I would have to dig up old emails to confirm that memory.) I know of no papers which explore this detail. I always figured it would be a good Master's paper for someone interested in old-school cheminformatics. Cheers, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] How to prevent a SMILES from starting with a specific atom?
On May 12, 2021, at 05:08, Francois Berenger wrote: > Or, more generally, flag a given atom in a molecule > and ask rdkit to not start the corresponding SMILES with > this atom, any unflagged atom being fine. Perhaps do the opposite and use rootedAtAtom to have RDKit start with a specific atom that you know is fine? https://www.rdkit.org/docs/source/rdkit.Chem.rdmolfiles.html#rdkit.Chem.rdmolfiles.MolToSmiles • rootedAtAtom: (optional) if non-negative, this forces the SMILES to start at a particular atom. Defaults to -1. Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] rejoining pairs of fragments after fragmenting a molecule
Hi Ling, > On Apr 2, 2021, at 16:23, Ling Chan wrote: > > Thank you Francois, I took a look at your code and borrowed parts of it to > rejoin two molecules. It seems like my problem is solved. I eventually > arrived at something like example 4 in > https://www.programcreek.com/python/example/123334/rdkit.Chem.CombineMols > (which I discovered a bit late). > > Still, I am not sure if the code is safe. In particular, I wonder if the > following conditions are always valid. > • Chem.CombineMols simply concatenates the atomic indices from the > input molecules. > • The Chem.EditableMol constructor preserves atom ordering from the > input. > • RemoveAtom in EditableMol results in all indices above the deleted to > decrease by one, i.e. atom ordering is preserved. I've found that it's very hard to work with molecular graphs and preserve stereochemistry. Consider F/C=C/Cl breaking on the first bond, and the code I pointed you to. FragmentOnBonds() using '9' as the labels gives: [9*]/C=C/Cl.[9*]F My "smiles_weld" code converts that to: CC\%99=C/Cl.F%99 which can be re-canonicalized to the original: F/C=C/Cl . Or, with F[C@H](Cl)Br again, breaking on the first bond. FragmentOnBonds() gives [9*]F.[9*][C@H](Cl)Br smiles_weld converts that to F%99.[C@@H]%99(Cl)Br which is re-canonicalized as F[C@H](Cl)Br Handling this correctly in the molecule API requires paying careful attention to the bond direction, and bond attachment order around the atom, which changes with RemoveAtom() calls. I didn't see stereochemistry support in Francois's "bind_molecules()" nor in the connect_mols() at https://github.com/molecularsets/moses/blob/master/moses/baselines/combinatorial.py (one of the examples from the programcreek.com link you gave). If you don't need to support or preserve stereochemistry, then of course there's no problem. Cheers, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] rejoining pairs of fragments after fragmenting a molecule
On Mar 31, 2021, at 21:55, Ling Chan wrote: > I am trying to do something that I think is quite simple, but I have not > figured out a simple way. Don't know if I am missing something. I am sure > that ultimately I can figure it out, but I wonder if there is a good way. If you can work in SMILES space rather than molecule space, then try: http://dalkescientific.com/smiles_weld.py It's derived from a technique I developed for the mmpdb package. I called it 'welding' the SMILES strings. What I do is convert the wildcards into closures, then let RDKit merge the closures. (There are a few tricky parts, like support for double-bond stereo chemistry.) Here's an example, where I use a dictionary to tell the program that [1*] should be bonded to [2*]. >>> from rdkit import Chem >>> smi = "N#Cc1ccncc1" >>> mol = Chem.MolFromSmiles(smi) >>> frag_mol = Chem.FragmentOnBonds(mol, [1]) >>> frag_smi = Chem.MolToSmiles(frag_mol) >>> frag_smi '[1*]c1ccncc1.[2*]C#N' >>> import smiles_weld >>> smiles_weld.convert_wildcards_to_closures(frag_smi, {1: 1, 2: 1}) 'c%991ccncc1.C%99#N' >>> Chem.CanonSmiles('c%991ccncc1.C%99#N') 'N#Cc1ccncc1' If you use matching dummy labels then you can omit the conversion table: >>> frag_mol = Chem.FragmentOnBonds(mol, [1], dummyLabels=((4,4),)) >>> frag_smi = Chem.MolToSmiles(frag_mol) >>> frag_smi '[4*]C#N.[4*]c1ccncc1' >>> smiles_weld.convert_wildcards_to_closures(frag_smi) 'C%99#N.c%991ccncc1' >>> Chem.CanonSmiles('C%99#N.c%991ccncc1') 'N#Cc1ccncc1' Note: while the mmpdb code is well-tested, I modified it this morning to handle what I think you want, and I haven't fully tested the new code. The program assumes the SMILES is a canonical SMILES generated by RDKit, and that the wildcard labels don't have a charge, hydrogen count, or other attribute. Cheers, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] inter-classes Tanimoto similarity
> On Mar 13, 2021, at 20:29, Marawan Hussien via Rdkit-discuss > wrote: > my question is if this is the valid approach of comparison, particularly if > the class sizes vary widely and the average similarity will be inevitably > affected by the size of each item in each pair. As a check, it looks that the > diagonal is having the highest inter-classes similarity overall, which is > anyway expected. > > I am also wondering if a size-weighted normalization approach could handle > this situation? What about a Z-score? That is: zscore = (score - background_score) / background_standard_deviation rather than using the mean score. I worked out something like that a few years ago, using chemfp, at http://www.dalkescientific.com/writings/diary/archive/2017/03/27/chembl_target_sets_association_network.html . If that's a reasonable approach, then it could all be done in RDKit, if you don't want to use chemfp. Best regards, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] chemfp 3.5.1 and ChEMBL 27 FPB distributions
Hi all, I've just released chemfp 3.5.1 with support for "licensed FPB files". These are fingerprint datasets which can be used under the terms of chemfp's base license agreement even without a chemfp license key or source code distribution. As the first (and so far only) data set, I've converted the RDKit Morgan fingerprints from the ChEMBL 27 release into FPB format and made it available from https://chemfp.com/datasets/ . If you are on a Linux-based OS and RDKit is already installed then here are the steps to get started: 1) Install a pre-compiled version of chemfp for Linux: python -m pip install chemfp -i https://chemfp.com/packages/ 2) Download the ChEMBL data set in FPB format using one of the following: wget https://chemfp.com/datasets/chembl_27.fpb.gz -or- curl -O https://chemfp.com/datasets/chembl_27.fpb.gz 3) (Optional but recommended) Uncompress it: gunzip chembl_27.fpb.gz 4) Do a similarity search, for example, with a query SMILES or query file: simsearch --query c1c1O chembl_27.fpb simsearch --queries queryfile.sdf chembl_27.fpb 5) View the ChEMBL license agreement and legal notices included with the dataset. python -m chemfp fpb_text chembl_27.fpb Chemfp is a Python package for cheminformatics fingerprints. It can be used to: - generate fingerprints using the RDKit, Open Babel, CDK, and OEChem/OEGraphSim toolkits; - extract pre-computed fingerprints from SDF tag data; - do high-performance Tanimoto and Tversky similarity search; - integrate fingerprints and search results with NumPy/SciPy. - ... and much more! It includes an extensive and well-documented Python API for working with fingerprints and a set of command-line tools for fingerprint generation, conversion, and similarity search. Chemfp natively supports two fingerprint file formats. The FPS format is a text format which is easy to read and write. The FPB format is a binary format which is quick to load. For more information see https://chemfp.com/ . License keys for academic use are available at no cost. Best regards, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Partial substructure match?
0.409 0.333 0.265 0.200 CHEBI:1148 7 6 6 5 0.273 0.208 0.261 0.200 CHEBI:1734 19 21 16 15 0.727 0.625 0.640 0.500 CHEBI:1895 9 9 9 8 0.409 0.333 0.409 0.320 ... Finally, nearly all of the MCS parameters can be configured on the command-line. This program was written by Andrew Dalke . """ import sys import argparse from rdkit import Chem from rdkit.Chem import rdFMCS def get_chemfp_rdkit_toolkit(option): try: import chemfp except ImportError: sys.stderr.write("chemfp must be installed to use the %s argument.\n" % (option,)) sys.stderr.write("Install it with:\n") sys.stderr.write(" python -m pip install chemfp -i https://chemfp.com/packages\n;) raise SystemExit(10) if not hasattr(chemfp, "__version_info__") or chemfp.__version_info__[0] < 3: sys.stderr.write("A newer version of chemfp must be installed to use the %s argument." % (option,)) sys.stderr.write("Upgrade your version with:\n") sys.stderr.write(" python -m pip install chemfp -i https://chemfp.com/packages --upgrade\n") raise SystemExit(10) from chemfp import rdkit_toolkit return rdkit_toolkit parser = argparse.ArgumentParser( description="Compute MCS similarity scores between a query and target structure or structures", epilog = EPILOG, formatter_class=argparse.RawDescriptionHelpFormatter, ) # Query / target parser.add_argument("--query", metavar="SMILES", help="default query SMILES (default: naphthalene)") target_group = parser.add_mutually_exclusive_group() target_group.add_argument("--target", metavar="SMILES", help="default target SMILES (default: phenol)") target_group.add_argument("--targets", metavar="FILENAME", help="read target structures from FILENAME; use '-' for stdin") # MCS arguments parser.add_argument("--maximize-atoms", action="store_true", help="maximize the number of atoms (the default maximizes the number of atoms") parser.add_argument("--match-valences", action="store_true", help="atom matches must have the same valence") parser.add_argument("--ring-matches-ring-only", action="store_true", help="ring atoms must only match ring atoms") parser.add_argument("--complete-rings-only", action="store_true", help="only allow complete rings to match") parser.add_argument("--match-chiral-tag", action="store_true", help="atom matches must have the same chiral tag") parser.add_argument("--atom-compare", choices=("Any", "AnyHeavyAtom", "Elements", "Isotopes"), default="Elements", help="method for comparing two atoms (default: Elements)") parser.add_argument("--bond-compare", choices=("Any", "Order", "OrderExact"), default="Order", help="method for comparing two bonds (default: Order") parser.add_argument("--ring-compare", choices=("IgnoreRingFusion", "PermissiveRingFusion", "StrictRingFusion"), default="IgnoreRingFusion", help="method for handling ring fusion (default: 'IgnoreRingFusion'") _atom_comparisons = { "Any": rdFMCS.AtomCompare.CompareAny, "AnyHeavyAtom": rdFMCS.AtomCompare.CompareAnyHeavyAtom, "Elements": rdFMCS.AtomCompare.CompareElements, "Isotopes": rdFMCS.AtomCompare.CompareIsotopes, } _bond_comparisons = { "Any": rdFMCS.BondCompare.CompareAny, "Order": rdFMCS.BondCompare.CompareOrder, "OrderExact": rdFMCS.BondCompare.CompareOrderExact, } _ring_comparisons = { "IgnoreRingFusion": rdFMCS.RingCompare.IgnoreRingFusion, "PermissiveRingFusion": rdFMCS.RingCompare.PermissiveRingFusion, "StrictRingFusion": rdFMCS.RingCompare.StrictRingFusion, } # Configure how to parse the query and target(s) parser.add_argument("--query-format", metavar="FORMAT", help="query format (default is 'smi' for SMILES)") parser.add_argument("--target-format", metavar="FORMAT", help="target format (default: based on filename extension, defaults to SMILES") parser.add_argument("--id-tag", metavar="STR", h
Re: [Rdkit-discuss] Morgan FP atom numbering
On Oct 26, 2020, at 17:41, Cyrus Maher wrote: > I’m wondering if there is an easy way to retrieve the atom numbers that the > morgan fingerprints algorithm assigns as its first step. Many of the fingerprint function support an optional "bitInfo" parameter. If it's a dictionary then the keys are the bit that was set, and the value is at tuple of the (atom index, radius) which set it. Here's an example with theobromine using r=0, which lets you see the initial invariants: >>> from rdkit import Chem >>> from rdkit.Chem import rdMolDescriptors >>> mol = Chem.MolFromSmiles("Cn1cnc2c1c(=O)[nH]c(=O)n2C") >>> bitInfo = {} >>> fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(mol, radius=0, >>> useFeatures=1, bitInfo=bitInfo) >>> for bitno, pairs in sorted(bitInfo.items()): ... print(f"Bitno: {bitno}") ... for atom_idx, r in pairs: ... print(f" atom {atom_idx} ({mol.GetAtomWithIdx(atom_idx).GetSymbol()}) with radius {r}") ... Bitno: 0 atom 0 (C) with radius 0 atom 12 (C) with radius 0 Bitno: 2 atom 7 (O) with radius 0 atom 10 (O) with radius 0 Bitno: 4 atom 2 (C) with radius 0 atom 4 (C) with radius 0 atom 5 (C) with radius 0 atom 6 (C) with radius 0 atom 9 (C) with radius 0 Bitno: 5 atom 8 (N) with radius 0 Bitno: 6 atom 1 (N) with radius 0 atom 3 (N) with radius 0 atom 11 (N) with radius 0 If I follow the code correctly, when useFeatures == 1 then the intial invariants are set by getFeatureInvariants() in ./Code/GraphMol/Fingerprints/FingerprintUtil.cpp , available at: https://github.com/rdkit/rdkit/blob/d594998dda2803a15aa0066e06f2477b71ed3be6/Code/GraphMol/Fingerprints/FingerprintUtil.cpp#L221 A few lines up, at https://github.com/rdkit/rdkit/blob/d594998dda2803a15aa0066e06f2477b71ed3be6/Code/GraphMol/Fingerprints/FingerprintUtil.cpp#L182 , you'll see the features patterns defined in smartsPatterns They appear to be identical to the list you gave. I reimplemented the initialization function (copied at the end of this email). Running the program shows that it produces the same invariants which are used as the bit numbers in the Morgan feature fingerprint: Invariant: 0 atom 0 (C) atom 12 (C) Invariant: 2 atom 7 (O) atom 10 (O) Invariant: 4 atom 2 (C) atom 4 (C) atom 5 (C) atom 6 (C) atom 9 (C) Invariant: 5 atom 8 (N) Invariant: 6 atom 1 (N) atom 3 (N) atom 11 (N) I believe that gives you two ways to get the information you want! Best regards, Andrew da...@dalkescientific.com # Python re-implementation of RDKit's getFeatureInvariants() from # ./Code/GraphMol/Fingerprints/FingerprintUtil.cpp from rdkit import Chem smartsPatterns = [ "[$([N;!H0;v3,v4&+1]),\ $([O,S;H1;+0]),\ n&+0]", # Donor "[$([O,S;H1;v2;!$(*-*=[O,N,P,S])]),\ $([O,S;H0;v2]),\ $([O,S;-]),\ $([N;v3;!$(N-*=[O,N,P,S])]),\ n&+0,\ $([o,s;+0;!$([o,s]:n);!$([o,s]:c:n)])]",# Acceptor "[a]", # Aromatic "[F,Cl,Br,I]", # Halogen "[#7;+,\ $([N;H2&+0][$([C,a]);!$([C,a](=O))]),\ $([N;H1&+0]([$([C,a]);!$([C,a](=O))])[$([C,a]);!$([C,a](=O))]),\ $([N;H0&+0]([C;!$(C(=O))])([C;!$(C(=O))])[C;!$(C(=O))])]", # Basic "[$([C,S](=[O,S,P])-[O;H1,-1])]"# Acidic ] mol = Chem.MolFromSmiles("Cn1cnc2c1c(=O)[nH]c(=O)n2C") invariants = [0] * mol.GetNumAtoms() for pattern_idx, smartsPattern in enumerate(smartsPatterns): pat = Chem.MolFromSmarts(smartsPattern) for (atom_idx,) in mol.GetSubstructMatches(pat): invariants[atom_idx] |= (1
Re: [Rdkit-discuss] sd file format question
ou can also use the --verify flag to generate and compare the SMILES strings before and after the conversion. Best regards, Andrew da...@dalkescientific.com # Copy charges from the "M CHG" data lines to the atom block # Written by Andrew Dalke, 2 October 2020 import argparse import sys import gzip # This requires chemfp. See https://chemfp.com/license/ for licenses. # The default license is the "chemfp Base License" from chemfp import text_toolkit # For Linux: python -m pip install chemfp -i https://chemfp.com/packages/ # From the documentation: # 0 = uncharged or value other than these, 1 = +3, 2 = +2, 3 = +1, # 4 = doublet radical, 5 = -1, 6 = -2, 7 = -3 _chg_to_ccc_table = { +3: 1, +2: 2, +1: 3, -1: 5, -2: 6, -3: 7, } class ChargeError(Exception): pass def set_atom_block_charges(record, verify=False): lines = record.splitlines(True) num_atoms = int(lines[3][:3]) num_bonds = int(lines[3][3:6]) for line in lines[3 + num_atoms + num_bonds:]: if line.startswith(b"M END"): break if not line.startswith(b"M CHG"): continue # Process each charge assignment on this line try: num_fields = int(line[6:9]) for field_index in range(num_fields): offset = 10 + 8*field_index atom_index = int(line[offset:offset+3]) chg = int(line[offset+4:offset+7]) # 0 means uncharged or unsupported charge ccc = _chg_to_ccc_table.get(chg, 0) # Update the correct atom line to have the charge atom_line = lines[3 + atom_index] atom_line = b"%s%3d%s" % (atom_line[:36], ccc, atom_line[39:]) lines[3 + atom_index] = atom_line except (ValueError, IndexError) as err: # This shouldn't happen unless I did something wrong. raise ChargeError("Unable to process charges in %r" % (line,)) if verify: # Remove the "M CHG" line i = lines.index(line) lines[i] = b"" return b"".join(lines) ### Handle command-line processing parser = argparse.ArgumentParser( description="copy charge information from the 'M CHG' data line to the atom block") parser.add_argument("--output", "-o", metavar="FILENAME", help="output file name (default: stdout)") parser.add_argument("--roundtrip", action="store_true", help="use RDKit to parse the record and regenerate the SDF record") parser.add_argument("--verify", action="store_true", help="ensure the input and output SMILES match") parser.add_argument("--no-set", action="store_true", help="don't set the charges (useful if you want to see the round-trip output)") parser.add_argument("input", nargs="?", metavar="FILENAME", help="input filename (default: stdin)") def open_binary_output(filename): if filename is None: return getattr(sys.stdout, "buffer", sys.stdout) # support Python 2 and 3 elif filename.endswith(".gz"): return gzip.open(filename, "wb") else: return open(filename, "wb") def main(): args = parser.parse_args() verify = args.verify roundtrip = args.roundtrip no_set = args.no_set if verify or roundtrip: from chemfp import rdkit_toolkit as T with text_toolkit.read_sdf_records(args.input) as record_reader: outfile = open_binary_output(args.output) for record in record_reader: if roundtrip: mol = T.parse_molecule(record, "sdf", errors="ignore") if mol is None: sys.stderr.write("Warning: Could not process molecule record: %s -- skipping\n" % ( record_reader.location.where(),)) continue record = T.create_bytes(mol, "sdf") if verify: # Get a reference SMILES string smi1 = T.create_string(T.parse_molecule(record, "sdf"), "smistring") if no_set: output = record else: try: output = set_atom_block_charges(record, verify) except ChargeError as err: sys.stderr.write("ERROR: %s: %s\n" % (err, record_reader.location.where())) raise SystemExit("
Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory
On Sep 9, 2020, at 04:00, Lewis Martin wrote: > I'd like to keep it FOSS since its for academic publication and hopefully to > be re-used. Chemfp is amazing but brute-forcing 100million by 100million > would surely still take a long time compared with an approximate nearest > neighbor approach. As a rough guideline, the chemfp paper Fig. 3 says that searching 100M PubChem fingerprints for the k=10 neighbors averages about 100 ms on a Haswell machine (See https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0398-8 ). That was for chemfp 3.3 using 881 bits. You have 512 bits, which is faster (0.6x time, according to my timings). Chemfp 1.6 is about the same performance as chemfp 3.4 for this case. So that's 60ms per query, on average, or 70 days for a single-processor query. (It's hard to compare directly though as the BitBound algorithm is sublinear; I found the PubChem fingerprint kNN search scales at O(n** ~0.6) while Morgan 2048-bit scales at O(n** ~0.8). Then again, the Haswell machine is several years old, so I'm leaving the numbers as-is for an estimate. This is easily parallelized yourself, or you can use chemfp's built-in parallelized NxN kNN searches. (The amount of scaleup on a single multi-core machine depends on your memory architecture.) This means that it shouldn't be hard to get a <1 week performance out of chemfp. Performance-wise, chemfp 3.4.1 (the latest version of the commercial track) is the same as chemfp 1.6.1 (the latest *no-cost*/open-source version). One caveat - chemfp 1.x has a limit of 2**31 bytes for the fingerprints, or 33.6M 512-bit fingerprints. (chemfp 2.x or later use 64-bit indexing and don't have this limitation.) To get around that, with 1.6.1, you'll need split your data into 4 chucks, and search each independently. This also reduces the effect of sublinear scaling. Or, there's no-cost academic licensing for chemfp 3.x using pre-compiled binaries which work on most Linux-based OSes. Still, I think an exact search is entirely doable using FOSS software. Switching to a smaller fingerprint size is a form of approximate nearest neighbors. If you take Greg's observation and drop the fingerprint size from 512 bits to 128 bits then you can get about a factor of 3 faster, and have everything fit into chemfp 1.6's 2GB limit. Best regards, Andrew da...@dalkescientific.com P.S. On Wed, Sep 9, 2020 at 11:29 AM Francois Berenger wrote: > There are many published methods, some open-source software (like > Dalke's chemfp) and even some commercial ones > which claim they are lightning fast (even reaching real-time search > speed!). Dalke's chemfp also claims real-time search speed. ;) It's always a claim of what size that means. Also, at least a couple of those systems use multiple threads to speed up single-query search, which interferes with multi-threaded batch search, as with NxN search on a single machine. ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Rdkit-discuss] MACCS keys - revisited
On Sep 8, 2020, at 14:30, Mike Mazanetz wrote: > Does anyone know whether it’s possible to obtain not just a fingerprint keys > for MACCS (binary values) but the number of occurrences of the keys, > particularly these details: The SMARTS patterns for most of the MACCS keys is available by: >>> from rdkit.Chem import MACCSkeys >>> for key, smarts in MACCSkeys.smartsPatts.items(): ... print("[%s] %s" % (key, smarts)) ... [1] ('?', 0) [2] ('[#104]', 0) [3] ('[#32,#33,#34,#50,#51,#52,#82,#83,#84]', 0) [4] ('[Ac,Th,Pa,U,Np,Pu,Am,Cm,Bk,Cf,Es,Fm,Md,No,Lr]', 0) [5] ('[Sc,Ti,Y,Zr,Hf]', 0) [6] ('[La,Ce,Pr,Nd,Pm,Sm,Eu,Gd,Tb,Dy,Ho,Er,Tm,Yb,Lu]', 0) ... There are two parts to the right-hand-side: SMARTS pattern and count. If the SMARTS pattern is a "?", that means the pattern is not defined at the SMARTS level. There must be at least count+1 matches. That is, if the count is 0 then there must be at least one match. You write "the number of occurrences of the keys". I don't know how that makes sense for all the keys. You have things like: 140: (key(164)-3 if key(164)>3; else 0) 141: (key(160)-2 if key(160)>2; else 0) 142: (key(161)-2 if key(161)>1; else 0) These correspond to RDKit's definitions: [140] ('[#8]', 3) [141] ('[CH3]', 2) [142] ('[#7]', 1) How do you count those number of occurrences? > On Sep 8, 2020, at 21:56, Mike Mazanetz wrote: > The KNIME node does a lot of double counting for the RDKit Substructure > Counter, so it’s not a useful tool for counting MACCS keys. Something like [11] ('*1~*~*~*~1', 0) has many matches due to symmetry. You have to decide if you think this should be counted once, or if all 8 matches should be counted. The molecule method 'GetSubstructMatches()' has a uniquify option; by default it only returns unique counts. ("Unique" is based on unique atoms, not unique atoms and bonds. I don't think that distinction affect the MACCS patterns.) Regards, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] chemfp 1.6 and 3.4 releases
On Jun 25, 2020, at 16:27, Andrew Dalke wrote: > > See https://chemfp.com/license/ for details, or to get started: > > python -m pip install chemfp -i https://chemp.com/packages/ --upgrade That should be python -m pip install chemfp -i https://chemfp.com/packages/ --upgrade Oops! Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] chemfp 1.6 and 3.4 releases
Hi RDKit'ers, I've just released new versions of chemfp. Version 1.6 is the no-cost/open source version, and 3.4 is the commercial version. The goal of chemfp 1.6 is to provide a good performance baseline for evaluating new Tanimoto search programs. This release is about 10-20% faster than chemfp 1.5. I believe it is the fastest no-cost/open source search program available for CPUs. It can be installed with: python -m pip install chemfp --upgrade However, it only supports Python 2.7, which means it's only useful for legacy RDKit installations. On the other hand, chemfp 3.4 supports Python 2.7 and Python 3.6+, is faster, and has many more features than chemfp 1.6, including a "toolkit" I/O abstraction to handle reading/writing structures to/from a file or a string. The comprehensive documentation is at: https://chemfp.readthedocs.io/en/latest/ See the "What's New" section for a list of changes over time. Chemfp 3.4 is available with academic and commercial (proprietary) licenses, both for pre-compiled binary packages which work on most Linux-based OSes, and for an option also get source code access. Open source licensing is still available, as the most expensive option. With this release I am making those pre-compiled packages available for anyone to download under a base license agreement for in-house use, which limits or restricts access to some features: • fingerprint arenas may not be larger than 50,000 fingerprints; • in-memory arena searches may not have more than 50,000 queries or targets; • FPS searches may not have more than 20 queries; • Tversky search is disabled; • writing FPB files is disabled. These can be enabled with a valid license key. There's a special waiver which lets you use this version of chemfp to generate FPS files for any purpose, including commercial use. (This also gives a way for chemfp 1.6 to be a useful benchmark baseline even with new data sets and fingerprint types.) See https://chemfp.com/license/ for details, or to get started: python -m pip install chemfp -i https://chemp.com/packages/ --upgrade A one-year no-cost license key for the pre-compiled packages is also now available for academics. Contact me if you have questions or are interested in evaluating or licensing chemfp. Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Number of sp3 atoms
On May 31, 2020, at 15:23, Chris Swain via Rdkit-discuss wrote: > I’d like to include the number of sp3 atoms, is there an easy way to do this? I don't easily see a function for that. There's rdMolDescriptors.CalcFractionCSP3() which "returns the fraction of C atoms that are SP3 hybridized". You can do it yourself by looking at the atom's hybridization: >>> mol = Chem.MolFromSmiles("CN1C=NC2=C1C(=O)N(C(=O)N2C)C") >>> sum((a.GetHybridization() == Chem.HybridizationType.SP3) for a in >>> mol.GetAtoms()) 3 Cheers, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] SMILES/SMARTS codes that match multiple atoms
On Feb 8, 2020, at 17:55, Janusz Petkowski wrote: > > If not how can I match cases where in a given position there can be C or H > with rdkit? I believe you should use #1 instead of H. >>> from rdkit import Chem >>> mols = [Chem.MolFromSmiles(s) for s in ["C(=O)OC", "C(=O)OCC", "C(=O)OCCC"]] >>> hmols = [Chem.AddHs(mol) for mol in mols] Your pattern: >>> pat1 = Chem.MolFromSmarts("[H]C(=O)OC([C,H])([H])[H]") >>> [mol.HasSubstructMatch(pat1) for mol in hmols] [False, True, True] Using #1 instead of H: >>> pat2 = Chem.MolFromSmarts("[H]C(=O)OC([C,#1])([#1])[#1]") >>> [mol.HasSubstructMatch(pat2) for mol in hmols] [True, True, True] "H" has an odd interpretation. https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html says: Note that atomic primitive H can have two meanings, implying a property or the element itself. [H] means hydrogen atom. [*H2] means any atom with exactly two hydrogens attached I believe the goal of having [H] match a hydrogen atom is to allow a SMILES, when interpreted as a SMARTS, to be able to match the SMILES when interpreted as a molecule. I'm not sure about that though. Cheers, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] last call for mmpdb funding
Hi all, This is the last email I'll send asking for people and organizations to join the current mmpdb crowdsourcing effort. I've discussed it several times before here. In summary, I'm looking for crowdfunding for the matched molecular pair program 'mmpdb'. This is part of a test to find alternative ways to raise money for open source development in cheminformatics. See http://mmpdb.dalkescientific.com/ for details. Currently I've pleased to say the effort has raised EUR 17 500. This is enough to fully finish off Postgres support and the new 'proprulecat', and pay back for the time taken to organize this funding effort. In addition, it passed the EUR 16 000 goal, which means that next month I'll work change mmpdb so it stores the environment as a more easily interpretable fragment SMILES, rather than a hashed version of the circular environments. The next funding goal is EUR 23 000, which is EUR 5 500 away. If I reach that funding goal, I will commit make a public/no-cost release of the new code in Oct. 2020. I've extended the deadline to join to 15 February because some additional marketing will go out at the end of this month, and I want to give recipients a chance to participate. People can still purchase mmpdb after the deadline is reached. Those goals merely represent specific commitments from me to work on mmpdb. For those interested in budget, or who think that EUR 17 500 is already a large amount of money. EUR 17 500 is corporate income, not salary income. About 50% of that goes to payroll taxes. The average software developer salary here in Sweden is about EUR 50 000, so EUR 9 000 is about 2 months of time. (I would be making above average salary should I decide to go corporate.) I spent several weeks working on the web site - which is essential marketing for crowdsourcing - and paid a web designer to help. I've also spent a few weeks working on improvements already delivered to customers. Even invoicing takes time. But the goal of this effort isn't for me to be rich - though that would be nice. It's to see if an open-source project like mmpdb can be economically self-sustaining though funding by users interested in paying for specific new features and support. Best regards, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] acepentalene aromaticity perception
On Jan 22, 2020, at 14:12, Greg Landrum wrote: > As an aside: it's not particularly relevant to this discussion, but I don't > understand why the wikipedia page says that the compound is anti-aromatic. I > think the standard definition of anti-aromaticity (agrees with the one linked > to from the acepentalene page) requires the ring system to have 4n electrons. > That definitely doesn't apply here to either the individual rings or the > system as a whole. The system as a whole has 10 electrons (4n+2), the > individual rings each have 5 (neither aromatic nor anti-aromatic), and the > outer envelope has 9 (again, neither aromatic nor anti-aromatic). Because I didn't know either, I looked into it. I think that's because (to quote "Towards experimental determination of conical intersection properties:a twin state based comparison with bound excited states", Phys. Chem. Chem. Phys., 2011,13, 11872–11877 [*] ) > A Hückel MO analysis[21] leads to the conclusion that the ground state of the > conjugated tricyclic acepentalene I is a triplet state. DFT calculations > corrected this picture and showed a singlet global minimum distorted to C_s > symmetry with alternated single and double bonds,[22] which are well > described by the Lewis structures A(B,C). According to a B3LYP/6-31G* > calculation the lowest triplet state has also a high symmetric C_3v > configuration and lies 3.9 kcal/mol above the singlet ground state minimum. > Acepentalene I was characterized as an antiaromatic system [23] despite being > formally an aromatic 10 electron system: the resonance between each pair of > Kekule structures in this case involves only 4 electron pairs of the > pentalene fragments and it averts the resonance with the additional fifth > electron pair common for both the structures. Such a resonance is described > as an anti-combination of two Kekule structures: (A–B), (C–B) and (C–A). Just need to add B3LYP/6-31G* calculations to RDKit's aromaticity perception algorithm and everything will be fine. :) The "characterized as an antiaromatic system[23]" is "T. K. Zywietz, H. Jiao, P. v. R. Schleyer and A. de Meijere, J. Org.Chem., 1998, 63, 3417" at https://pubs.acs.org/doi/abs/10.1021/jo980089f . Cheers, Andrew da...@dalkescientific.com [*] https://www.researchgate.net/profile/Shmuel_Zilberg/publication/51175586_Towards_experimental_determination_of_conical_intersection_properties_A_twin_state_based_comparison_with_bound_excited_states/links/561bb5bc08ae6d17308b037f/Towards-experimental-determination-of-conical-intersection-properties-A-twin-state-based-comparison-with-bound-excited-states.pdf ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] acepentalene aromaticity perception
Hi all, Could someone explain the following, which uses the SMILES from https://en.wikipedia.org/wiki/Acepentalene : >>> from rdkit import Chem >>> Chem.CanonSmiles("C1=CC2=CC=C3C2=C1C=C3") 'c1cc2ccc3ccc1-c=3-2' >>> import rdkit >>> rdkit.__version__ '2019.09.1' I don't understand the aromatic "c" in the fused center of the 3 5-membered rings. It's connected by non-aromatic bonds to the rest of the system. This broke some code of mine which expects that every aromatic atom must have at least two aromatic bonds. I thought that all aromatic atoms had to be in aromatic rings, and that all aromatic rings had to have aromatic bond. (I'm ignoring RDKit's support for aromatic triple bonds in this description.) I searched for "acepentalene" and "antiaromatic" in the issue tracker and the mailing list but found nothing relevant. Cheers, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Clearing isotope info
On Dec 12, 2019, at 17:39, Rafal Roszak wrote: > I also had situation when I need to generate smiles with either > isotopes or stereochemistry but not both. Maybe it is worth to add two > options to ChemMolToSmiles function: > > dontIncludeStereochemistry=True/False > dontIncludeIsotopes=True/False > > Right now it is not straightforward to generate smiles w/o isotopes > (but with stereochemistry) - one need to remove isotope, export to > smiles and restore isotopes. Bear in mind a few complications. I believe the following correctly implements what you describe: from rdkit import Chem def MolWithoutIsotopesToSmiles(mol): atom_data = [(atom, atom.GetIsotope()) for atom in mol.GetAtoms()] for atom, isotope in atom_data: if isotope: atom.SetIsotope(0) smiles = Chem.MolToSmiles(mol) for atom, isotope in atom_data: if isotope: atom.SetIsotope(isotope) return smiles >>> mol = Chem.MolFromSmiles("[19F][13C@H]([16OH])[35Cl]") >>> MolWithoutIsotopesToSmiles(mol) 'O[C@@H](F)Cl' Testing reveals two problems with my implementation: 1) isotopic hydrogens Consider the same structure with tritium instead of fluorine: >>> mol = Chem.MolFromSmiles("[3H][13C@H]([16OH])[35Cl]") >>> MolWithoutIsotopesToSmiles(mol) '[H][C@H](O)Cl' That output should be 'OCCl'. 2) stereochemistry assignment needs to be recalculated after the isotopes have been removed: >>> mol = Chem.MolFromSmiles("C[C@H]([13CH3])CI") >>> MolWithoutIsotopesToSmiles(mol) 'C[C@@H](C)CI' >>> Chem.CanonSmiles("C[C@H]([CH3])CI") 'CC(C)CI' 2b) This includes directional bonds >>> mol = Chem.MolFromSmiles("C/C(=C/CO)/[11CH3]") >>> MolWithoutIsotopesToSmiles(mol) 'C/C(C)=C/CO' >>> Chem.CanonSmiles("C/C(=C/CO)/[CH3]") 'CC(C)=CCO' Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] assign all bond directions in SMILES
Hi all, Is there any way to assign all bond directions (E/Z stereochemistry) to the output SMILES string? For example, here's a structure: >>> mol = Chem.MolFromSmiles(r"F/C(Cl)=C(O)/N") >>> Chem.MolToSmiles(mol) 'N/C(O)=C(/F)Cl' It's a minimal definition, in that I could have specified the directions for all of the bonds: >>> mol = Chem.MolFromSmiles(r"F/C(/Cl)=C(\O)/N") >>> Chem.MolToSmiles(mol) 'N/C(O)=C(/F)Cl' Note that RDKit figured out which bond directions were minimal. The underlying code checks for conflicting assignments: >>> mol = Chem.MolFromSmiles(r"F/C(/Cl)=C(/O)/N") [18:25:25] Conflicting single bond directions around double bond at index 2. [18:25:25] BondStereo set to STEREONONE and single bond directions set to NONE. >>> Chem.MolToSmiles(mol) 'NC(O)=C(F)Cl' What I want is some way to go from N/C(O)=C(/F)Cl to a fully specified F/C(/Cl)=C(\O)/N Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] MolToSmiles preserve atom order
On Nov 18, 2019, at 17:40, David Cosgrove wrote: > > Point taken. I don’t think you’d be able to get RDKit to spit such SMILES > strings out unless you tortured it pretty hard, however. Did someone mention one of my favorite things to do? :) See: http://dalkescientific.com/writings/diary/archive/2010/12/28/reordering_smiles.html Note that that code does not preserve stereochemistry. It's for Python 2, so change the: available_closures = range(100) to available_closures = list(range(100)) to make it work under Python 3. Here's what it looks like: >>> from x import reordered_smiles >>> from rdkit import Chem >>> mol = Chem.MolFromSmiles("OCCl") >>> atoms = list(mol.GetAtoms()) >>> reordered_smiles(mol, [atoms[1], atoms[0], atoms[2]]) '[CH2]12.[OH1]1.[Cl]2' Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] Second call for mmpdb funding
Hi all, The end of the year is coming up. Perhaps there's extra money in your budget which can go to support open source development in cheminformatics? As many of you know, I started a crowdfunding effort to fund improvements to the matched molecular pair program "mmpdb". I want to see if crowdfunding could help pay for long-term project sustainability in our field. At the institutional level, this effort is very much like a standard consortium. If you join the crowdfunding effort, you get new features which are not available to the general public, such as a command-line tool to export the rules and Postgres support. These are available right now. As more people join the consortium, I will commit to adding new features. One which interests many people is to replace the existing opaque hash environment fingerprint with a fragment SMILES. That will be funded if two additional corporate members join. Everyone who joins gets the new features under a permissive open source license. If enough people join, then I will commit to releasing the software to the public in the fall of next year. If you want mmpdb to be supported and improved, then you should get your organization to join the funding effort. Full details at http://mmpdb.dalkescientific.com/ . Best regards, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] missing MolFromSmiles error output in Jupyter
Dear Stéphane, > On Oct 16, 2019, at 19:39, Téletchéa Stéphane > wrote: > Did you 'by chance' transmit your presentation in PDF? Yes, I exported my Keynote.app presentation to PDF. However, I also sent the specific commands in email as plain text, as part of the process of trying to diagnose the problem. > This may come from a weird bug in quote management from a PDF file when you > copy/paste double quotes, The student sent me screen grabs. I can see that quotes are not the issue. If they were, I would expect to see a "SyntaxError: invalid character in identifier" as Python doesn't recognize left-handed or right-handed double quotes in a way that doesn't cause an error. Instead, the MolFromSmiles() appears to be evaluated. For what it's worth, I've long since disabled smart quotes in Keynote.app, and haven't had this problem with other students, which includes students working on Windows. The only thing I can see that's different is that my previous students used a 2018 release of RDKit, not 2019. And of course an older version of Jupyter. Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] missing MolFromSmiles error output in Jupyter
Hi all, I wasn't able to give my RDKit training session at the last UGM, so I passed out the presentation materials to the students who signed up. One of them wrote to me asking why the following didn't display an error message in the notebook. from rdkit import Chem from rdkit.Chem.Draw import IPythonConsole mol = Chem.MolFromSmiles("Q") On my machine I get a red message from stderr: RDKit ERROR: [15:41:35] SMILES Parse Error: syntax error for input: 'Q' so I don't know why it doesn't work for the student (who sent me screenshots demonstrating the lack). sys.version says: 3.7.4 (default, Aug 9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)] (which means it's an Anaconda build for MS Windows) and rdkit.__version__ is 2019.03.4 (I have an older version of RDKit on my laptop, which I used to verify that I get the message.) Any idea on what's going on, or what I can suggest trying? Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] SubstructMatch of identical Mols returns different results
On Oct 3, 2019, at 20:34, Ondrej Gutten via Rdkit-discuss wrote: > # MCS is a benzene > my_mcs = Chem.MolFromSmiles(res.smarts) The res.smarts (or res.smartsString if you use the rdFMCS module) returns a SMARTS string, not a SMILES string. You should be using Chem.MolFromSmarts() in the code I quoted. More specifically, the MCS SMARTS pattern is: [#6]1:[#6]:[#6]:[#6]:[#6]:[#6]:1 This can be interpreted as a SMILES, because RDKit supports "#6" as part of its extensions to the SMILES grammar. The "#6" means an element with atomic number 6. >>> Chem.CanonSmiles("[#6]1:[#6]:[#6]:[#6]:[#6]:[#6]:1") '[c]1[c][c][c][c][c]1' However, benzene has a different SMILES: >>> Chem.CanonSmiles("c1c1") 'c1c1' >>> Chem.MolToSmiles(Chem.MolFromSmiles("c1c1"), allBondsExplicit=True, allHsExplicit=True) '[cH]1:[cH]:[cH]:[cH]:[cH]:[cH]:1' Each of the benzene atoms has a hydrogen on it. The difference appears when you call: mol1.HasSubstructMatch(my_mcs) There is a difference if my_mcs is a molecule from a SMILES vs. from a SMARTS. They have different definitions of what it means to match. One of the differences is that a SMILES-made molecule considers the hydrogen counts: >>> mol = Chem.MolFromSmiles("OCc1c1") >>> query = Chem.MolFromSmiles("OC");mol.GetSubstructMatches(query) ((0, 1),) >>> query = Chem.MolFromSmiles("[O]C");mol.GetSubstructMatches(query) () >>> query = Chem.MolFromSmiles("[OH]C");mol.GetSubstructMatches(query) ((0, 1),) while if I made the query from a SMARTS: >>> query = Chem.MolFromSmarts("[O]C");mol.GetSubstructMatches(query) ((0, 1),) Cheers, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] mmpdb crowdfunding project has started
Hi all, In August I sent a pre-announcement email about my mmpdb crowdfunding project. The project is now live, at http://mmpdb.dalkescientific.com/ . The basic idea is that I can commit to developing a few features for mmpdb. • Postgres support, as an alternative to the existing SQLite support; • A new ‘mmpdb proprulecat’ command to export the property rules in the database (transformation details and property statistics of the associated pairs) in CSV form, along with a fragment SMILES for the given environment. Anyone who joins the consortium (at least EUR 1 000 for academics, EUR 5 000 for industry) will get these features under the existing open source license. As more people join, the additional funding will be used to improve mmpdb. All consortium members will get the new improvements. These new features will only be implemented if a certain funding level is reached, for example, with EUR 16 000 in funding I'll add support for storing and using SMILES strings for the environment fragments. To incentivize people to join, I don't plan to release the code to the public until receiving at least EUR 23 000 in funding, in which case I'll release it by 1 October 2020. (Since it's open source, consortium members are also free to distribute the new code.) With EUR 50 000 in funding, I'll release the software to the public as soon as I can. The full list of goals and funding levels is at http://mmpdb.dalkescientific.com/goals.html . I'll be at the RDKit UGM in Hanburg in a couple of days, so if you are there and have questions about it, or about funding open source software in chemformatics (or about chemfp, or SMILES, or many other topics), come talk with me! Cheers, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Tanimoto and fingerprint representation
Hi Jan, The GetMorganFingerprint() returns count fingerprints, and the Tanimoto calculation does the full Jaccard similarity, including the counts. The GetMorganFingerprintAsBitVect() version only uses the keys (that is, it treats all non-zero values as being 1) when computing the Tanimoto. > On Sep 14, 2019, at 11:07, Jan Halborg Jensen wrote: > > When using GetMorganFingerprintAsBitVect I get the “expected” Tanimoto score > > mol1 = Chem.MolFromSmiles('CCC') > mol2 = Chem.MolFromSmiles('CNC') > > fp1 = AllChem.GetMorganFingerprintAsBitVect(mol1,2,nBits=1024) > fp2 = AllChem.GetMorganFingerprintAsBitVect(mol2,2,nBits=1024) >>> list(fp1.GetOnBits()) [33, 80, 294, 320] >>> list(fp2.GetOnBits()) [33, 128, 406, 539] You can see the intersection is 1 and the union is 7, giving 1/7 = 0.142... as the Tanimoto, which is what you demonstrated was the result. > However, when using GetMorganFingerprint I get a difference score. > > fp1 = AllChem.GetMorganFingerprint(mol1,2) > fp2 = AllChem.GetMorganFingerprint(mol2,2) >>> fp1.GetNonzeroElements() {2068133184: 1, 2245384272: 1, 2246728737: 2, 3542456614: 2} >>> fp2.GetNonzeroElements() {847961216: 1, 869080603: 1, 2246728737: 2, 3824063894: 2} Note that there is one shared key (2246728737) while the other 7 are unique. The binary Tanimoto - treating all counts as 1 - gives 1/7, matching the BitVect version. On the other hand, the common value 2246728737 is present 2 times in each fingerprint, and 3542456614 and 3824063894 are each present twice in one fingerprint, so the Jaccard, or count Tanimoto, is 2 / ((1+1+2+2)+(1+1+2+2)-2) = 2/10 = 0.2 matching the value you computed. Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] aromatic bonds and graph edit distance
Hi Jameed, I don't think your approach will work, which means I likely didn't explain myself well enough. Let's say I start with: Cc1cc2c2c(=O)o1 - https://cactus.nci.nih.gov/chemical/structure/Cc1cc2c2c(=O)o1/image I want to break the aromatic bond between the aromatic 'c' (also double-bonded to the O) and the aromatic 'o'. This is possible because a Kekule interpretation of that molecule is: >>> from rdkit import Chem >>> mol = Chem.MolFromSmiles("Cc1cc2c2c(=O)o1") >>> Chem.Kekulize(mol) >>> Chem.MolToSmiles(mol) 'Cc1cc2c2c(=O)o1' >>> Chem.MolToSmiles(mol, kekuleSmiles=True) 'CC1=CC2=CC=CC=C2C(=O)O1' and I could break it by simply inserting a "." in the right spot, like this: CC1=CC2=CC=CC=C2C(=O).O1 >>> mol2 = Chem.MolFromSmiles("CC1=CC2=CC=CC=C2C(=O).O1") >>> Chem.MolToSmiles(mol2) 'CC(O)=Cc1c1C=O' The resulting structure still has an aromatic ring, but not all of the atoms which were flagged as aromatic are still aromatic. With the approach you outlined, I don't know how to restored the modified structure so I can generate the final SMILES I expect. The Kekulize() approach finds a valid Kekule interpretation, but there may be multiple interpretations, allowing both a single and double bond for a given bond position. I think I want to generate both possibilities. So far the only solution I've come up with - and it's a real brute-force one - is to do complete enumeration, along the lines of the code after my signature. % python get_kekule_bonds.py 'Cc1cc2c2c(=O)o1' bondfragatoms allowed 1 c:c 1,2 DOUBLE 2 c:c 2,3 SINGLE 3 c:c 3,4 SINGLE DOUBLE 4 c:c 4,5 SINGLE DOUBLE 5 c:c 5,6 SINGLE DOUBLE 6 c:c 6,7 SINGLE DOUBLE 7 c:c 7,8 SINGLE DOUBLE 8 c:c 8,9 SINGLE 10 c:o 9,11SINGLE 11 c:o 11,1SINGLE 12 c:c 8,3 SINGLE DOUBLE (To be usable for what I want, I also need to figure out where those bond terms are in the Kekeule SMILES, so I can either replace it with a "." or remove the ring closure, then re-perceive.) This can work for what I'm looking for, since I have <=14 heavies, but I'm hoping for something rather less crude. Andrew da...@dalkescientific.com # List possible Kekule bond assignments for aromatic bonds in a SMILES from rdkit import Chem import itertools import sys def get_kekule_bonds(mol): """mol -> possible bond assignment table Returns a table mapping bond index (for aromatic bonds) to possible Kekule bond assignments for that bond. """ # Figure out which bonds are aromatic aromatic_bond_indices = {} for bond in mol.GetBonds(): if bond.GetIsAromatic(): a1, a2 = bond.GetBeginAtomIdx(), bond.GetEndAtomIdx() if a1 > a2: a1, a2 = a2, a1 aromatic_bond_indices[(a1, a2)] = bond.GetIdx() if not aromatic_bond_indices: return {} # Look for assignments which result in the same Kekeule SMILES # (with aromatic bonds indicated as ':') expected_kekule_smiles = Chem.MolToSmiles(mol, kekuleSmiles=True) # Initialize the table of bond assignments possible_bond_assignments = dict((idx, set()) for idx in aromatic_bond_indices.values()) # Figure out how to get from the canonical Kekule SMILES output order # back to the moleule order. s = mol.GetProp("_smilesAtomOutputOrder") assert s[0] == "[" and s[-2:] == ",]", s invert_order = dict(enumerate(map(int, s[1:-2].split("," # Figure out the places where I can replace a ":" with a "-" or "=" parts = expected_kekule_smiles.split(":") N = len(parts)-1 assert N == len(aromatic_bond_indices) terms = [None] * (2*N+1) terms[::2] = parts # Generate all possible bond assignments (generates 2**N assignments) for bond_assignment in itertools.product( *(("-=",)*N) ): # Replace the places where the ":" used to be (or rather, the previous assignment) # and interpret as a new SMILES terms[1::2] = bond_assignment new_kekule_smiles = "".join(terms) new_mol = Chem.MolFromSmiles(new_kekule_smiles) if new_mol is None: continue # Was there a match? canonical_kekule_smiles = Chem.MolToSmiles(new_mol, kekuleSmiles=True) if canonical_kekule_smiles != expected_kekule_smiles: continue # Yes! That means we can figure out what the actual bond assignments # were. Re-process the Kekule SMILES, but don't do aromaticity perception. raw_mol = Chem.MolFromSmiles(new_kekule_smiles, sanitize=False) for bond in raw_mol.GetBonds(): # Need to map back to the original bonds original_a1 = invert_order[bond.GetBeginAtomIdx()] original_a2 =
Re: [Rdkit-discuss] aromatic bonds and graph edit distance
On Aug 21, 2019, at 03:42, Francois Berenger wrote: > Unless rdkit has something, I think graph edit distance is the kind > of things for which you have to rely on a good graph library. Do you know of any (non-chemical) graph library which can handle edits involving the breaking of aromatic bonds in a chemically correct way? I do not. > Also, maybe the string edit distance between the two canonical smiles is a > good enough proxy. This attempt of mine now, to experiment with graph edit distance, came out of a conversation I had last week with someone using string edit distance. I expressed doubt on how "good" the "good enough" was, but was unable to give any concrete details. I earlier wrote: >> For chain bonds, and non-aromatic bonds, it's easy to delete the bond >> and add the correct number of hydrogens to either side. Similarly, for many chain edits, the string edit distance is a decent proxy, as you say. However, has the goodness ever been characterized? Along with a description of how to minimize the problems with string edit distance? Some of the obvious ones are: 1) Chirality and stereochemistry L-alanine and D-alanine have a graph edit distance to alanine with unspecified chirality are 4 and 5, respectively. N[C@H](C)C(=O)O N[C@@H](C)C(=O)O NC(C)C(=O)O This does not seem reasonable. A similar issue occurs with double bond sterochemistry, like F/C=C/F vs. FC=CF. 2) Isotopes Same issue: CN vs. [14CH3]N. 3) Overlapping element symbols c1c1C and c1c1Cl have an edit distance of 1 c1c1C and c1c1Br have an edit distance of 2 There is no chemical sense for those to have different distances. I can think of ways to mitigate some of the effects of #1-3. In particularly, a substitution matrix (or conversion to pharmacophore reduced graphs) can improve #3. 4) Sensitivity to canonicalization order Depending on the canonicalization method, the following two structures either have a string edit distance of 1 or 4, while the graph edit distance is 1. >>> Chem.CanonSmiles("PCCN") 'NCCP' >>> Chem.CanonSmiles("CCN") 'CCN' 5) difficulty in handling ring formation in a meaningful way >>> Chem.CanonSmiles("C1=CC=CC=C1") 'c1c1' >>> Chem.CanonSmiles("C=CC=CC=C") 'C=CC=CC=C' There are no shared string synbols, so the string edit distance is 9, yet the bond edit distance is only 1. It is this last issue that I am particularly concerned with, leading me to ask about how to handle aromatic bonds when computing the graph edit distance. Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] aromatic bonds and graph edit distance
Hi all, Someone asked me recently about finding the graph edit distance of two small (<= 14 atom) fragments. I figured this was something that could be brute forced. Following SmallWorld's example at https://cisrg.shef.ac.uk/shef2016/talks/oral13.pdf , given a fragment, incrementally delete terminals (except the "*" connection point atom), and ring bonds. For chain bonds, and non-aromatic bonds, it's easy to delete the bond and add the correct number of hydrogens to either side. But, what should I do when I cut an aromatic bond? For something like the first "co" in "c1cocn1", I want the result to be C=CN=CO. That's because the "o" can only be "-O-" in Kekule form. For something like "c1cnncn1", breaking on the "nn", I think I would like to get both 'N=CC=NC=N' and 'NC=CN=CN' because the "nn" can be a single or a double bond, depending on the Kekule representation, as in: >>> Chem.CanonSmiles("C-1=N-N=C-C=N-1") 'c1cnncn1' >>> Chem.CanonSmiles("C-1=N.N=C-C=N-1") 'N=CC=NC=N' >>> Chem.CanonSmiles("C=1-N=N-C=C-N=1") 'c1cnncn1' >>> Chem.CanonSmiles("C=1-N-[HH].[HH]N-C=C-N=1") 'NC=CN=CN' Problem is, I don't know how to figure out if a given aromatic bond must be a "-" or "=", or can be both. (Well, I could brute-force enumerae all 2**n possible aromatic bond assignments, then canonicalize, and see if both assignments are possible for a given bond.) As a non-chemist, I also ask if I'm even on a chemically meaningful track. Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] GetSubstructMatches() as smiles
On Aug 7, 2019, at 13:08, Paolo Tosco wrote: > You can use > > Chem.MolFragmentToSmiles(mol, match) > > where match is a tuple of atom indices returned by GetSubstructMatch(). Note however that if only the atom indices are given then Chem.MolFragmentToSmiles() will include all bonds which connect those atoms, even if the original SMARTS does not match those bonds. For example: >>> from rdkit import Chem >>> pat = Chem.MolFromSmarts("*~*~*~*") # match 4 linear atoms >>> mol = Chem.MolFromSmiles("C1CCC1") # ring of size 4 >>> atom_indices = mol.GetSubstructMatch(pat) >>> atom_indices (0, 1, 2, 3) >>> Chem.MolFragmentToSmiles(mol, atom_indices) # returns the ring 'C1CCC1' If this is important, then you need to pass the correct bond indices to MolFragmentToSmiles(). This can be done by using the bonds in the query graph to get the bond indices in the molecule graph. I believe the following is correct: def get_match_bond_indices(query, mol, match_atom_indices): bond_indices = [] for query_bond in query.GetBonds(): atom_index1 = match_atom_indices[query_bond.GetBeginAtomIdx()] atom_index2 = match_atom_indices[query_bond.GetEndAtomIdx()] bond_indices.append(mol.GetBondBetweenAtoms( atom_index1, atom_index2).GetIdx()) return bond_indices (Does a function like this already exist in RDKit?) I'll use it to get the bond indices for the *~*~*~* match: >>> bond_indices = get_match_bond_indices(pat, mol, atom_indices) >>> bond_indices [0, 1, 2] Passing the atom and bond indices gives the expected match SMILES: >>> Chem.MolFragmentToSmiles(mol, atom_indices, bond_indices) '' Cheers, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] planning a crowdsourcing project for mmpdb development
ent. I can understand, because I know I'm weirded out by the restaurants which allow you to pay what you think the food was worth! Instead, here are the funding levels: Academic/Individual - EUR 1 000 Research group - EUR 3 000 Organization - EUR 5 000 All crowdsourcing consortium members will receive the mmpdb source code under the existing 3-clause BSD license. Those wishing to pay an additional EUR 3 000 will receive 1 year of support. That EUR 3 000 will added to the sum used to determine if a funding goal is met. The funding status will be made public. Funders can be left anonymous or agree to have their names used. I generally use purchase orders, paid to my company's Swedish bank account. I can accept credit card payments through PayPal, though will charge 10% more because of the overhead. If enough people commit to the project, I will send out a formal quote to the backers, and start the purchase order process. Timeline As a rough timeline: Today - Pre-announcement on the RDKit list for planning purposes 1 September - Announcement on the RDKit list 28 September - Lightning talk at the RDKit UGM 1 October - Second announcement on the RDKit list 1 November - Call off the trial if there has not been enough interest 1 December - End of fundraising / see which funding goals are met Note: I define "interest" as people sending me an email saying they/their company wants to join, at a given funding level. It does not mean that I have a purchase order from everyone. It should be seen as a verbal agreement, not a written or financial commitment. Project work will likely start shortly after the interest level meets the basic level goal. If this trial proves very successful (over EUR 30 000 or so), then one likely future crowdsourcing project will be to develop a web interface for mmpdb. Questions & Answers === 1) Is it really "open source development" if it isn't available to everyone for free? Most well-known open source software is also available at no cost. However, open source, and free software, describe the rights and freedoms available to the person who receive the software. They do not require that the software be made available for no cost. Think if you modify a GPL project like Open Babel, and distribute the changes to a friend. There's nothing about the GPL, or the principles of open source, which require you to also distribute that code to anyone else. I am influenced by Stallman's essay at https://www.gnu.org/philosophy/selling.html . He states "if you are redistributing copies of free software, you might as well charge a substantial fee and make some money. Redistributing free software is a good and legitimate activity; if you do it, you might as well make a profit from it. ... Distributing free software is an opportunity to raise funds for development. Don't waste it!" 2) What happens to the code if the 'Delayed upstream level' goal of EUR 20 000 isn't reached? That's a tough one. At the very least, any of the backers who receive a copy of the code could contribute the code to the main mmpdb repository in the RDKit project. If there is little interest, I will likely release it after 9 months or a year, and also reduce my involvement in developing open source cheminformatics software, as I cannot continue to do things for free or low wages. 3) What about Oracle support, or feature X? Then let me know. Part of my goal now is to figure out which to include as the features to implement at the basic level. If there's more interest, I might have two feature levels. In any case, I am also available for consulting work. 3) Where should I send questions and suggestions? Right now, private email to me is the best. I'll set up a mailing list and project web page if I get preliminary feedback that it's worth my time to go further with this trial. Thanks for reading to the end! Andrew Dalke da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Open-source business models and the RDKit (Greg Landrum)
On Mar 27, 2019, at 13:26, Chris Swain via Rdkit-discuss wrote: > This is an interesting discussion and suspect this does not only apply to > open-source software developers, there are similar challenges for small > independent software companies. My points were focused on the disadvantages of a pure open source software development strategy with respect to proprietary software development. There are, of course, many other problems which are shared between small independent software vendors of both free and proprietary software. On the other hand, there are also many existing resources on developing commercial propriety software as a small ISV. > On CICAG (http://www.rsccicag.org) we have been discussing the possibility of > organising a (probably one day) meeting around the topic of open > data/software/publishing and sustainability. At this point I cannot recommend that any ISV consider an open source product as a viable strategy, not even for marketing purposes as name recognition for future consulting work. My current belief is that people hire a consultant to solve problems. They hear from talks or from other people that person X can help solve problems, and hire X. If person X writes tool Y to solve certain problems, and sells that as a proprietary product, then people will contact X in order to purchase it to solve their problem. If tool Y is distributed as open source, then they don't really have a problem, because they can solve their problem themselves by installing a package, and it works. That only takes a few minutes. They may not even know who X is. This is especially true if it's distributed through the usual channels like (Bio)conda, PyPI, and DebianScience. As I've pointed out before, I've seen innumerable posters where the poster author thanks a commercial vendor for a no-cost academic license for a product, but does not thank the authors of the free software packages also used. Which is perfectly acceptable under the respective licenses, but I think also an indicator that people feel more obligation for someone who actively helps solves their problems than for something like "pip install Y". Regards, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Open-source business models and the RDKit
On Mar 27, 2019, at 16:44, Bennion, Brian via Rdkit-discuss wrote: > One of the goals of ATOM is to fund work that will be open sourced. I think > any of the partners can choose to hire consultants for the work. > > https://atomscience.org/ > Atom > atomscience.org I think there are only three successful external funding strategies for open source in cheminformatics: 1) grant funding, 2) cost reduction, and 3) R budget for new, and typically pre-competitive, software. It appears that ATOM gets it funding from #1 and #3. And I have no problem with that. I've been funded as a consultant from grant funding. Most of RDKit was funded (if I understand it correctly) by #2 and #3, but the funding is smaller and less direct now that Greg has left his previous employer. My interest is in successful models for self-funded open source developer. Who will pay to maintain and develop "Our Digital Infrastructure" (to use the term from https://www.fordfoundation.org/about/library/reports-and-studies/roads-and-bridges-the-unseen-labor-behind-our-digital-infrastructure/ )? I'll be more specific. How can we get EUR 250K/year to fund RDKit development? This would cover a full-time developer at commercial wages, plus overhead, and have money available to pay people for special work - new features, performance, documentation, and so on. It also makes project continuity easier. Not only is it easier for Greg to maintain and develop the project if there's a clear revenue source, but it also increases the pool of people who might be able to replace him, whenever he decides to step down. In principle the money is there, given the number of commercial users with extra money. The questions I ask again are "why would they contribute money?", "why haven't they contributed money?", and "how do we effectively encourage them to contribute money in the future?". By comparison, we know that it's possible to make a living selling self-funded proprietary software in this field. Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Open-source business models and the RDKit
On Mar 27, 2019, at 08:24, Francois Berenger wrote: > As an open-source project, I feel rdkit is quite successful. > So, the user community is not so small. > Some people who cannot contribute time could contribute money to the project > (especially if it is tax-deductible, I guess). I think the questions are "why would they contribute money?" and "why haven't they contributed money?". If those questions cannot be answered well, then there's little reason to go further down this path to the next question, which is "how do we effectively encourage them to contribute money in the future?". To be clear, Novartis contributed a lot of money for the RDKit development. Roche also funded me to develop and contribute the MCS package now part of the RDKit core, and the mmpdb project which was contributed to RDKit. These are also financial contributions and must not be ignored, and these are not the only two organizations which have done that. But I honestly thought that there would be more interest in hiring my services as a consultant, to work on further development of open source software. I feel like there are clear economic benefits for companies to fund open source packages. Instead, it feels like the more open source software packages I write and release, the fewer leads I get for new consulting work, compared to when I gave "I wrote this in-house application for company X that no one else will ever use" talks. Perhaps what's easily available for no cost is seen as having no value, while that which is hidden, no matter how hacky, is treasured? My optimism started 20 years ago, when I was still involved with the Biopython project. My company offered commercial support for Biopython, and I had NDAs in place with several of the other Biopython developers so we could easily be funded to work on specific improvements that an organization might need. I never found someone interested in providing that sort of funding for Biopython, and it still looks like that's the case in cheminformatics. See also 'Roads and Bridges: The Unseen Labor Behind Our Digital Infrastructure' (ref. 53 in my paper) for further examples of the difficulties in funding open source work. https://www.fordfoundation.org/about/library/reports-and-studies/roads-and-bridges-the-unseen-labor-behind-our-digital-infrastructure/ > On Mar 27, 2019, at 10:06, Greg Landrum wrote: > If rdkit was accepted at the software freedom conservancy, I understand > the management fee would be 10%: There's also Software in the Public Interest, which "serves the free software and open source community by facilitating the administrative and financial needs of its associated projects", including the Open Bioinformatics (ex-)Foundation. When the OBF was created, it was common for many groups to start their own foundations. Since most of the administrative needs are the same for the different projects, it makes sense to consolidate. > A question since I genuinely don't know: is it important to anyone that this > go through a not-for-profit entity? The OBF became a not-for-profit to make it easier to organize the BOSC (Bioinformatics Open Source Conference) meetings. Some of the early BOSC meetings were run out of someone's personal bank account, and he was personally financially liable in case of problems. Working through a non-profit makes it easier to set up things like summer internships (a la Google Summer of Code) and travel support, because the payment is less likely to be viewed as a way to get around employment laws. Open Bioinformatics has a Travel Fellowship program. I don't know the details. Looking at the report for 2018 at http://spi-inc.org/corporate/annual-reports/2018.pdf , Open Bioinformatics spends about $5,000/year for IT and meet ups, an "ordinary income" of $5,400, and an equity of $85K. There's overhead to running a non-profit, like filing paperwork, and that requires specialized knowledge. For revenues that small, it really helps to be affiliated with an existing umbrella organization. The OBF gave up their incorporation in 2012 to be an SPI-associated project. For what RDKit does now, I see no need to set up/join a foundation. T5 Informatics can organize an RDKit UGM the same way that any vendor can organize a UGM, and company acts as the firewall to your personal finances. T5 (or Dalke Scientific :) can also act as an intermediary if, for some reason, a company does not wish to fund someone directly. Though you'll have your own overhead in that case, because of the additional tax requirements in dealing with subcontractors. No matter what, that's going to be easier than arranging things through a university, even Paul's. It's only if you start getting multiple organizations interested in contributing funding, or really want the transparency that T5 and RDKit funding are not intermingled, where I would suggest looking at that. Another reason is if you want RDKit-the-organization
Re: [Rdkit-discuss] chemfp preprint
On Mar 25, 2019, at 04:05, Francois Berenger wrote: > Sometimes, I wish there was a rdkit consortium/NPO (so that donations are tax > deductible), so that rdkit could be massively funded by all its commercial > users, and even accepting individual donations. Setting up such an organization is not difficult. It does take time, money, and effort, which add overhead to the funding process. It also requires people who are willing to do that sort of work. I was on the board of the Open Bioinformatics Foundation, and involved with the Python Software Foundation. I know that I am *not* that sort of person. In any case, as Geoff Hutchinson points out, there are umbrella organizations like the Open Source Collective which can handle most of that overhead. My question is, why would users - commercial or otherwise - be willing to fund such work in the first place? As far as I can tell, nearly everyone uses free and open source software because they are available for no cost. Users are rarely willing to pay for software freedom, or for economic benefits like avoiding vendor lock-in. And it seems like commercial users are often willing to use an internal fork of a project than to work with upstream to develop new features. This might be because it's easier to work with existing staff or existing contracting arrangements than figuring out how to get upstream to do the work and setting up a new contractual relationship, and take the risk that it isn't done to schedule. My conjecture is that there are several issues at play. 1) Most end users don't realize there is a funding problem for many FOSS projects. Package managers like pip/PyPI, Conda, Homebrew and apt make it *really easy* to install a large number of packages without knowing anything about the funding or staffing status of each underlying project. Consider that one of the early business models for PyMol was the idea that people would be willing to pay for pre-compiled packages from the main developer, even though the source code was available for free as open source. That business model somewhat worked then. It would not work now. 2) The proponents of "open source" in the late 1990s emphasized the volunteer nature of open source, going so much as to argue that there was a "gift culture" (using E. Raymond's term). The implication is that there was a sort of social contract, where donations of source code would be met with other sorts of payment, including job/consulting offers and non-trivial amounts of reciprocal code contributions. This has not turned out to be true, with rare exceptions. Instead, I think the association with volunteerism and gifts has caused people to avoid talking about fund raising. This should be particularly odd as many volunteer organizations outside of computing have funding drives. 3) FOSS developers who distribute at no cost are ignoring any capital value in the software. They can only make income on gifts (which are rare) or through labor (e.g., consulting). This places them at a funding disadvantage compared to proprietary software vendors who can amortize labor costs across multiple sales. To be clear, I am only talking about self-funded FOSS projects. My paper mentions a few other funding models, like research grants at universities, or in-house projects funded by the ability to reduce costs. In the latter case, the minor additional costs for releasing the project as FOSS can be justified by even small benefits. 4) The pricing of per-unit sales of FOSS software, either institutional sales like what I tried with chemfp, or end-user sales like PyMol, should factor in the likelihood that customers will redistribute the software further, and by doing so reduce the market size. This factor is hard to estimate, and higher in general for universities than pharmaceutical companies, which makes it harder to give a significant discount to universities like what proprietary vendors can do. 5) In my paper I bring up "free rider problem" as a way to think of the issues. To be clear, this is only a *problem* if people expect anything back from releasing and/or maintaining an open source software project. (Or don't expect people to insist on support, like I have received for the no-cost/open source version of chemfp.) Suppose I want to add a new feature to mmpdb, the matched molecular pair program which I helped develop and has been contributed to the RDKit project. I might go around to various users and ask for development funding as a consortium. 20 organizations might be interested, and each one willing to pay 50% of the development cost, which means in principle I could get 10x the cost of labor, which provides the extra profit that could go towards documentation, testing, and general support. However, it's also easy for each of those 20 organizations to think that someone else ("Let George do it", as the Stanford Encyclopedia of Philosophy article explains) is going
[Rdkit-discuss] chemfp preprint
Hi RDKit users, This week I submitted a paper about chemfp for publication. I also submitted a preprint on ChemRxiv, which was just accepted. For those interested, it's at https://chemrxiv.org/articles/The_Chemfp_Project/7877846 . It's a rather long paper as it covers many aspects about the chemfp project, including the FPS and FPB formats, search algorithms, details about the different ways to compute a popcount, and memory bandwidth and latency bottlenecks. On a non-technical level I also describe some of the difficulties I ran into trying to run chemfp as "commercial free software." Let me know of any corrections or improvements, or any other feedback you might have. Cheers, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] contrib code not compiling
On Nov 19, 2018, at 04:17, Rajarshi Guha wrote: > Hi, I check out the latest RDKit sources from master and I'm trying to > compile the PBF. However, the compilation fails reporting that > RDGeneral/export.h is missing: While this doesn't answer the question, it seems to be coupled to https://github.com/rdkit/rdkit/issues/1903 > (I haven't compiled RDKit as I already have it installed via conda) It appears that 'export.h' is created during the "cmake" step. That file is an 'auto-generated __declspec definition header'. Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Butina clustering with additional output
On Sep 26, 2018, at 20:26, Peter S. Shenkin wrote: > Ah, David, but how do you define a "real" singleton? There can be many different definitions of what a '"real" singleton' might be, but we are specifically talking about Butina clustering. The Butina paper defines the term "false singleton", which Dave quoted. The relevant text from DOI: 10.1021/ci9803381 is: """The molecules that have not been flagged by the end of the clustering process, either as a cluster centroid or as a cluster member, become singletons. It is important to emphasize at this stage that one of the consequences of this approach is that some molecules defined as singletons may have neighbors at the given Tanimoto similarity index, but those neighbors have been excluded by a ‘stronger’ cluster centroid, i.e., one with more neighbors in its list. the problem with the creation of a number of false singletons that do in fact have similar compounds within the set is easily offset by the final quality of the clusters that this approach generates.""" As you can see, there are two types of singletons, and one is called "false singleton". No specific name is used for the other type of singleton, but it's easy to how they can be called "real" singletons, without confusion or misunderstanding. (FWIW, my implementation, mentioned in an earlier email, uses the term "true singleton" as the singleton which is not a "false singleton", but the difference is only in the label.) To confirm that this is what Dave means, I'll quote from his paper Blomberg, N., Cosgrove, D. A., Kenny, P. W., & Kolmodin, K. (2009). Design of compound libraries for fragment screening. Journal of Computer-Aided Molecular Design, 23(8), 513–525. doi:10.1007/s10822-009-9264-5 """The clustering program flush_clus is an implementation of the sphere-exclusion algorithm of Taylor [41], which has also been reported independently by Butina ... One consequence of the algorithm is the production of ‘false singleton clusters.’ The final clusters in the output are invariably singleton clusters, where the only member is the seed. Some of these will be true singletons, i.e. molecules lacking neighbors within the clustering threshold, but others (the false singletons) will be singletons by virtue of the fact that their neighbors were placed in other larger clusters in a previous iteration of the algorithm. The flush_clus program offers the opportunity of performing a final sweep through the clusters using a larger similarity threshold and placing the singleton molecules within the cluster for which it has the greatest similarity with the seed, so long as this is within the threshold.""" Cheers, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Butina clustering with additional output
On Sep 25, 2018, at 17:13, Peter S. Shenkin wrote: > FWIW, in work on conformational clustering, I used the “most representative” > molecule; that is, the real molecule closest to the mathematical centroid. > This would probably be the best way of displaying a single molecule that > typifies what is in the cluster. In some sense I'm rephrasing Chris Earnshaw's earlier question - how does one do that with Taylor-Butina clustering? And does it make sense? The algorithm starts by picking a centroid based on the fingerprints with the highest number of neighbors, so none of the other cluster members should have more neighbors within that cutoff. I am far from an expert on this topic, but with any alternative I can think of makes me think I should have started with something other than Taylor-Butina. Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Butina clustering with additional output
On Sep 21, 2018, at 14:53, Philipp Thiel wrote: > you probably read about the Tanimoto being a proper metric in case of having > binary data > in Leach and Gillet 'Introduction to Chemoinformatics' chapter 5.3.1 in the > revised edition. What we call Tanimoto is more broadly known as the Jaccard. Various sites demonstrate that the Jaccard distance = 1-Jaccard = 1-Tanimoto is a metric, such as https://mathoverflow.net/questions/18084/is-the-jaccard-distance-a-distance and https://arxiv.org/abs/1612.02696 . Going back to James T. Metz's original question, one alternative might be to use chemfp and the Taylor-Butina clustering implementation available at: http://dalkescientific.com/writings/taylor_butina.py Following Dave Cosgrove's advice: > I expect James means what we used to call the cluster seed, i.e. the molecule > the cluster was based on, rather than the mathematical centroid. Calculating > distances from each cluster member to that would be quite straightforward as > a post-processing step although that would roughly double the time taken. it's possible to change the reporting code from: for centroid_idx, members in clusters: print(arena.ids[centroid_idx], "has", len(members), "other members", file=outfile) print("=>", " ".join(arena.ids[idx] for idx in members), file=outfile) so it does the post-processing: print(len(clusters), "clusters", file=outfile) for centroid_idx, members in clusters: print(arena.ids[centroid_idx], "has", len(members), "other members", file=outfile) subarena = arena.copy(indices=members) centroid_fp = arena.get_fingerprint(centroid_idx) result = subarena.threshold_tanimoto_search_fp(centroid_fp, threshold=0.0) result.reorder() # sort so the highest scores come first for id, score in result.get_ids_and_scores(): print("=>", id, "score:", score) Cheers, Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Equivalent atom neighbours
On Sep 7, 2018, at 22:22, Alexey Orlov wrote: > I'm trying to calculate the number of equivalent/nonequivalent neighbor > heteroatoms for each atom i of molecule m. > > For examples, the third carbon atom of molecule CC(OH)CC has two > nonequivalent neighbors: one carbon atom connected to OH and CH3 and another > carbon atom that is connected only to H. In the case of molecule CC(OH)C(C)C > the third carbon atom has two equivalent neighbor heteroatoms (methyl groups) > and another neighbor distinct from these two -- atom connected to OH and CH3. I think you are looking for Chem.CanonicalRankAtoms() described at http://rdkit.org/Python_Docs/rdkit.Chem.rdmolfiles-module.html#CanonicalRankAtoms . >>> mol = Chem.MolFromSmiles("CC(O)CC") >>> list(Chem.CanonicalRankAtoms(mol, breakTies=False)) [1, 4, 2, 3, 0] As you wrote, there are no equivalent atoms. >>> mol = Chem.MolFromSmiles("CC(O)C(C)C") >>> list(Chem.CanonicalRankAtoms(mol, breakTies=False)) [2, 5, 3, 4, 0, 0] this identifies the two identical methyl groups, both with rank 0. You might try something like: from collections import Counter def get_atoms_neighbor_rank_counts(mol): indices = [] ranks = list(Chem.CanonicalRankAtoms(mol, breakTies=False)) for atom in mol.GetAtoms(): equivalents = Counter() equivalents.update(ranks[a.GetIdx()] for a in atom.GetNeighbors()) counts = [count for rank, count in equivalents.most_common()] print("Neighbor symmetry counts for", atom.GetIdx(), "are", counts) For your first test case it gives: >>> mol = Chem.MolFromSmiles("CC(O)CC") >>> get_atoms_neighbor_rank_counts(mol) Neighbor symmetry counts for 0 are [1] Neighbor symmetry counts for 1 are [1, 1, 1] Neighbor symmetry counts for 2 are [1] Neighbor symmetry counts for 3 are [1, 1] Neighbor symmetry counts for 4 are [1] That is, all of the atoms have unique neighbors. For the second test case it gives: >>> mol = Chem.MolFromSmiles("CC(O)C(C)C") >>> get_atoms_neighbor_rank_counts(mol) Neighbor symmetry counts for 0 are [1] Neighbor symmetry counts for 1 are [1, 1, 1] Neighbor symmetry counts for 2 are [1] Neighbor symmetry counts for 3 are [2, 1] Neighbor symmetry counts for 4 are [1] Neighbor symmetry counts for 5 are [1] That is, the atom with index 3 (the 4th atom) has 3 neighbors, two of which are identical. Finally, >>> mol = Chem.MolFromSmiles("CC(C)(C)OC(C)(C)C") >>> get_atoms_neighbor_rank_counts(mol) Neighbor symmetry counts for 0 are [1] Neighbor symmetry counts for 1 are [3, 1] Neighbor symmetry counts for 2 are [1] Neighbor symmetry counts for 3 are [1] Neighbor symmetry counts for 4 are [2] Neighbor symmetry counts for 5 are [3, 1] Neighbor symmetry counts for 6 are [1] Neighbor symmetry counts for 7 are [1] Neighbor symmetry counts for 8 are [1] This says that the atoms with indices 1 and 5 each have 4 neighbor, 3 of which are identical, and that the atom with index 4 has 2 neighbors, both of which are identical. Andrew da...@dalkescientific.com ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] available Python/RDKit training slots for the 2018 UGM
Hi all, As you may know, I will be offering a free Python/RDKit training session before the UGM in a couple of weeks. This is a beginner level course for people with some programming experience. It will cover the basics of Python, RDKit, JupyterLab, Pandas, Scikit-Learn and more. The goal is to orient people, not provide a deep training. Enough people were on the waiting list that we've decided to open up the training course, and have it once on Monday the 17th (9am to 4pm) and again on Tuesday (9am to 4pm). We've been able to reschedule some people from the Tuesday slot to the Monday, so there is space on both days; roughly 6 slots on Monday and 4 on Tuesday. If you are interested in participating in this free workshop, let me know. Please specify which day you want to attend, or if you have no preference. You will need to bring your own laptop. I'll come back next week to let you know what software you need to have pre-installed. Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] no structure depiction in Jupyter notebook
On Aug 31, 2018, at 15:27, Axel Pahl wrote: > on Linux, using Anaconda, RDKit and Python 3.6, I always need to additionally > install cairocffi via pip: Thanks Axel. When I tried it out, I figured out the more likely problem - I was using jupyter from a non-conda virtualenv. My problem disappeared once I switched to a new terminal window (without the virtual environment) and started a new notebook server. Your suggestion was helpful because it made me realize that the "pip install" went to the wrong Python installation. Cheers, Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] no structure depiction in Jupyter notebook
On Aug 31, 2018, at 11:58, Andrew Dalke wrote: > I am unable to see an inline structure depiction in the Jupyter notebook, nor > in the JupyterLab notebook, tested with both the Python 2 and Python 3 > kernels, and rdkit.__version__ '2018.03.1'. I've narrowed it down to the Cairo code. I ran Draw._moltoimg manually, and the d2d.GetDrawingText() call returns b"". If I do del Draw.rdMolDraw2D.MolDraw2DCairo then the image generation code will fall back to MolToImage, and give me an inline depiction in the notebook. Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] no structure depiction in Jupyter notebook
Hi all, I am unable to see an inline structure depiction in the Jupyter notebook, nor in the JupyterLab notebook, tested with both the Python 2 and Python 3 kernels, and rdkit.__version__ '2018.03.1'. I installed miniconda and RDKit on my Mac using: curl -O 'https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh' bash Miniconda3-latest-MacOSX-x86_64.sh conda install conda-build conda install -c conda-forge jupyterlab conda install -c rdkit rdkit My test code is: from rdkit import Chem from rdkit.Chem.Draw import IPythonConsole mol = Chem.MolFromSmiles("c1c1O") mol It gives a broken image link, the contents of which are: data:image/png;base64, There is no message in the console log reporting a problem. My assumption is that I am missing a library which is needed for the PNG generation step. I have put the output of "conda list" at the end of this email. Can someone suggest what to do next? I did not see a discussion of this problem on the recent mailing list nor in the issue tracker. As a backup, I saw that IPythonConsole supports SVG mode. I enabled it using: IPythonConsole.ipython_useSVG = True This gave me another broken image icon. I tracked it down to incorrect handling of the "svg" namespace. A fix was for this was made on 14 May 2018, and will be part of the fall release. (See https://github.com/rdkit/rdkit/pull/1866 ). For now I can monkey-patch the run-time using: === from rdkit.Chem import Draw original_moltoSVG = Draw._moltoSVG def new_moltoSVG(*args, **kwargs): s = original_moltoSVG(*args, **kwargs) return s.replace("xmlns:svg='http://www.w3.org/2000/svg'", "xmlns='http://www.w3.org/2000/svg'") Draw._moltoSVG = new_moltoSVG === However, I would like to get the regular PNG output to work. Cheers, Andrew da...@dalkescientific.com # packages in environment at /Users/dalke/miniconda3: # # NameVersion Build Channel appdirs 1.4.3 py_1conda-forge appnope 0.1.0py36_0conda-forge asn1crypto0.24.0 py36_0 attrs 18.1.0 py_1conda-forge automat 0.7.0 py_1conda-forge backcall 0.1.0 py_0conda-forge beautifulsoup44.6.3py36_0 blas 1.0 mkl bleach2.1.4 py_1conda-forge bzip2 1.0.6h1de35cc_5 ca-certificates 2018.8.24ha4d7672_0conda-forge cairo 1.14.12 hc4e6be7_4 certifi 2018.8.24py36_1 cffi 1.11.5 py36h342bebf_0 chardet 3.0.4py36h96c241c_1 conda 4.5.11 py36_0conda-forge conda-build 3.13.0 py36_0 conda-env 2.6.0h36134e3_0 constantly15.1.0 py_0conda-forge cryptography 2.2.2py36h1de35cc_0 decorator 4.3.0 py_0conda-forge entrypoints 0.2.3py36_2conda-forge filelock 3.0.6py36_0 fontconfig2.13.0 h5d5b041_1 freetype 2.9.1hb4e5f40_0 gettext 0.19.8.1 h15daf44_3 glib 2.56.2 hd9629dc_0 glob2 0.6 py36_0 html5lib 1.0.1 py_0conda-forge hyperlink 17.3.1 py_0conda-forge icu 58.2 h4b95b61_1 idna 2.6 py36h8628d0a_1 incremental 17.5.0 py_0conda-forge intel-openmp 2018.0.3 0 ipykernel 4.9.0py36_0conda-forge ipython 6.5.0py36_0conda-forge ipython_genutils 0.2.0 py_1conda-forge jedi 0.12.1 py36_0conda-forge jinja22.10 py36_0 jpeg 9b he5867d9_2 jsonschema2.6.0py36_2conda-forge jupyter_client5.2.3 py_1conda-forge jupyter_core 4.4.0 py_0conda-forge jupyterlab0.34.6 py36_0conda-forge jupyterlab_launcher 0.13.1
Re: [Rdkit-discuss] descriptors beyond rotatable bond count and possible correlations with entropy
On Aug 31, 2018, at 07:41, Paolo Tosco wrote: > this gist should do what you need: Unless I misinterpreted what Jim is looking for, I don't think that returns the contiguous rotatable bonds in a small molecule. In the following there are only two rotatable bonds: >>> mol = Chem.MolFromSmiles("NCC1CCC1C(=O)O") >>> mol.GetSubstructMatches(RotatableBondSmarts) ((1, 2), (5, 6)) The code from the gist identifies three bonds: >>> find_bond_groups(mol) ((, , ),) >>> [(b.GetBeginAtomIdx(), b.GetEndAtomIdx()) for b in _[0]] [(5, 2), (5, 6), (1, 2)] In this case, (2,5) is part of a ring, and not rotatable. I think the problem is that: nbrs = [nbr.GetIdx() for nbr in a.GetNeighbors() if ( (nbr.GetIdx() in rot_atom_set_tmp) and (not (nbr.GetIdx() in connected_atom_set)))] finds that both atoms of the bond are in connected bond groups, while the bond itself is not part of the match. I have put an alternative implementation of this function at the bottom of this email. For my test case it returns: >>> find_bond_groups2(mol) ((,), (,)) >>> [[(b.GetBeginAtomIdx(), b.GetEndAtomIdx()) for b in x] for x in _] [[(5, 6)], [(1, 2)]] Cheers, Andrew da...@dalkescientific.com from collections import defaultdict def find_bond_groups_dalke(mol): rot_atom_pairs = mol.GetSubstructMatches(RotatableBondSmarts) bond_groups = defaultdict(set) for (left, right) in rot_atom_pairs: # Ensure they are in order (I'm not sure if this is required) if right < left: left, right = right, left pair = (left, right) # Add the pair to the group associated with the left atom bond_groups[left].add(pair) # Merge the left atom group with the right atom group bond_groups[left].update(bond_groups[right]) bond_groups[right] = bond_groups[left] # Get just the unique bond groups, in order from largest to smallest unique_bond_groups = set(frozenset(v) for v in bond_groups.values()) sorted_bond_groups = sorted(unique_bond_groups, reverse=True, key=len) # Return as bond objects, to match Paolo's code return tuple( tuple(mol.GetBondBetweenAtoms(left, right) for (left, right) in bond_group) for bond_group in sorted_bond_groups) -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] want advice for good teaching data set
Thanks for the responses. I'll merge them into one reply: On Aug 29, 2018, at 16:56, Eloy Félix wrote: > If you want to build model I guess that what you want is to get experimental > logp values. > > This should give you something to start with: > > select ACTIVITY_ID, MOLREGNO, STANDARD_VALUE, STANDARD_TYPE from ACTIVITIES > where STANDARD_TYPE = 'LogP' and STANDARD_VALUE is not null and > data_validity_comment is null and POTENTIAL_DUPLICATE = 0; Yes, that's what I was looking for, including the pointers for validity and if it might be a duplicate. Thanks! On Aug 29, 2018, at 15:51, TJ O'Donnell wrote: > ChEMBL 24 has compound properties in the table compound_properties. I think > the alogp > is computed using (Crippen) atom types and the acd_logp is uses ACD labs > methods. I can see I wasn't clear. I was looking for experimental data. The ChEMBL blog post at https://chembl.blogspot.com/2018/05/chembl-24-released.html says that they switched to using RDKit for alogp; acd_logp is still from ACD. On Aug 29, 2018, at 18:07, JW Feng via Rdkit-discuss wrote: > What about building QSAR models to predict activity for a particular ChEMBL > assay? This would allow you to discuss strength and limitations of QSAR > models. I am, primarily, a software developer working in computational chemistry. Do you want fast similarity search? I can do that. Do you want a maximum common structure algorithm, or matched molecular pair algorithm? I can do that. Do you want to tell me which parameters and learning algorithm you want to use? I can make the pieces go together. What I don't have is the expertise to build a chemically relevant model on my own, and discuss its strength and weaknesses. When I build a model, I do it to predict molecular weight. :) Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] want advice for good teaching data set
Hi all, I am starting to put together materials for the Python/RDKit training course I'm giving just before the RDKit UGM next month. I would like to structure part of it around the SQLite release of the ChEMBL data set. More specifically, I plan to include examples of machine learning with scikit-learn, using RDKit descriptors and values from ChEMBL 24 (and making sure to use the new schema). Two problems. First, I'm not a computational chemist and I don't know what would constitute a good example to use. "Good" in this case means one whose outlines are well-known to likely students. Second, I don't have much experience with the ChEMBL data. My thought is to make a logP model. The easiest would be to based it on atom types. For this option, can anyone suggest where I can find logP data from ChEMBL? Another possibility is to use a pre-existing model, like the notebook George Papadatos did for Ligand-based Target Prediction at http://nbviewer.jupyter.org/gist/madgpap/10457778 . Perhaps someone here could point me to other existing resources along similar lines? Best regards, Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] cartridge license?
On Aug 23, 2018, at 07:18, Roman Bolzern wrote: > Dear RDKittens, I would prefer to not be called a 'kitten'. > https://www.rdkit.org/docs/Cartridge.html#license, and at the bottom it says > “This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 > License”, ... > Is “this work” only the code in this docs page? Or what does this mean for a > business use case? What exactly do we have to share, what not? Before that sentence is the line "This document is copyright ..." At the top of the page it says: "This document is a tutorial and reference guide for the RDKit PostgreSQL cartridge." I regard that as saying that "this work" is a synonym for 'this document' and meant to refer to the tutorial and reference guide document. To verify that, I looked at the source code, available online at https://github.com/rdkit/rdkit/tree/master/Code/PgSQL/rdkit . The files all appear to have the standard BSD license, and they do not mention the CC license. The FAQ at https://creativecommons.org/faq/ suggests "While we recommend against using a CC license on software itself, CC licenses may be used for software documentation, as well as for separate artistic elements such as game art or music." This supports the hypothesis that the Attribution-ShareAlike 4.0 License is only meant to apply to the documentation. Cheers, Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] elimination of small fragments
On Jun 29, 2018, at 02:43, 藤秀義 wrote: > Although not strictly based on the number of atoms, but on the length of > SMILES string, the simplest way is using Python built-in functions as follows: > > smiles = 'CCC.CC' > fragment = max(smiles.split('.'), key=len) > print (fragment) The mmpdb package I helped develop includes some functions which work directly on the SMILES string. One of them uses a regular expression to count the number of heavy atoms, assuming the SMILES is valid and is written in 'normal' form, where the '.' is only used to distinguish between disconnected atoms (that is, things like "C1.C1" are not supported). >>> smiles = "CCC.BrCl.[238U]" >>> fragment = max(smiles.split('.'), key=len) >>> fragment '[238U]' >>> from mmpdblib.fragment_algorithm import get_num_heavies_from_smiles >>> fragment = max(smiles.split('.'), key=get_num_heavies_from_smiles) >>> fragment 'CCC' I keep meaning to put together a package with the various SMILES tricks that are possible without full chemistry perception. On Jun 29, 2018, at 10:56, Ed Griffen wrote: > Using the string length to find the number of atoms in a molecule is OK - but > you need to take account of the additional characters in SMILES that are not > just atoms, ... > Here’s a worked example: > > >>> SMILES = 'C[S@@+]([O-])c1ccc(cc1)[Si](C)(C)C' > >>> print(len(SMILES)) > 34 > >>> heavies = [char for char in SMILES if char not in > >>> '''()[]1234567890#:;,.?%-=+\/Hherlabdgfikmputvy@'''] > >>> print(len(heavies)) > 13 That's neat! But it doesn't always give the correct count. >>> def count(smiles): ... return sum(1 for c in smiles if c not in '''()[]1234567890#:;,.?%-=+\/Hherlabdgfikmputvy@''') ... >>> count("[Hg]") 0 >>> count("[Zn]") 2 >>> count("[Tc]") 2 >>> count("[As]") 2 as well as for aromatic boron, as in: >>> count("Cc1b[n+](C[n+]2cc(C)cc(C)c2)cc(C)c1") 16 >>> count("Cc1B[n+](C[n+]2cc(C)cc(C)c2)cc(C)c1") 17 and aromatic tellurium. These came up in a cross-comparison I did using ChEMBL as a test set. I excluded records with [2H] and and [3H] because RDKit considers those to be heavy atoms while Ed's method does not. The error rate is impressively low for such a simple approach, with only 467 mismatches out of 1,726,695 cases. Cheers, Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] elimination of small fragments
On Jun 28, 2018, at 22:08, Paolo Tosco wrote: > if you wish to keep only the largest disconnected fragment you may try the > following: > > mols = list(rdmolops.GetMolFrags(mol, asMols = True)) > if (mols): > mols.sort(reverse = True, key = lambda m: m.GetNumAtoms()) > mol = mols[0] A somewhat simpler .. or at least shorter ... version is: mols = rdmolops.GetMolFrags(mol, asMols = True) mol = max(mols, default=mol, key=lambda m: m.GetNumAtoms()) The max() function goes through the molecules that GetMolFrag returns. If the list is empty, it returns the 'default' value, which is the original molecule. (This is what Paolo's code does. Another option is to use None as the default value.) Otherwise, since 'key' is specified, its value is used as a function to determine a value for each molecule. That is, for each term 'm' in the list of 'mols', it computes m.GetNumAtoms(), and uses that return value to select an object with the maximum value. In this case, it selects a molfrag output molecule with the most atoms. I think I've just added a topic to cover for the upcoming Python/RDKit training session in September! :) For those interested, remember to sign up soon. Cheers, Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] (Morgan) fingerprints for specific atom?
Hi Андрей, The GetMorganFingerprint function takes additional parameters. From http://rdkit.org/Python_Docs/rdkit.Chem.rdMolDescriptors-module.html#GetMorganFingerprint GetMorganFingerprint( (Mol)mol, (int)radius [, (AtomPairsParameters)invariants=[] [, (AtomPairsParameters)fromAtoms=[] [, (bool)useChirality=False [, (bool)useBondTypes=True [, (bool)useFeatures=False [, (bool)useCounts=True [, (AtomPairsParameters)bitInfo=None]]]) -> UIntSparseIntVect : Returns a Morgan fingerprint for a molecule Specify the radius with the 'radius' parameter. The 'fromAtoms' takes a list of atom indices. If you want to get the circular fingerprint around atom 8 then pass in [8]. GetMorganFingerprintAsBitVect(...) is a similar function which returns a bitstring fingerprint instead of a count fingerprint. Cheers, Andrew da...@dalkescientific.com > On Jun 18, 2018, at 21:33, Андрей Парамонов wrote: > > Hello! > > It is possible to get Morgan (circular) fingerprints for the molecule: > GetMorganFingerprint(mol) > but what is the best way to get fingerprints for a particular molecule > atom and radius? I'm using Python RDKit bindings. > > Best wishes, > Andrey Paramonov -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] converting from ToBitString() to SMILES
> On Jun 17, 2018, at 21:04, Raghuram Srinivas > wrote: > Is there a way to convert a bit string of 2048 bits back to the SMILES / > BitVector representation of the molecule? Any help /pointers in this > direction will be much appreciated . That topic came up on this list in April of this year in a thread titled "Any known papers on reverse engineering fingerprints into structures?" I posted a rough draft of reversing a 2048-bit path fingerprint, in https://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg07861.html . Others gave additional references. Cheers, Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] boron atom/element support in RDkit
On Jun 12, 2018, at 18:00, Bennion, Brian via Rdkit-discuss wrote: > Does RDkit support boron in SMILES strings? We have a number of compounds > for which rdkit parsing is not successful. The commonality is that there is > a B or b listed in the string. RDKit supports boron, including aromatic boron. >>> from rdkit import Chem >>> mol = Chem.MolFromSmiles("[B]") >>> Chem.MolToSmiles(mol) '[B]' mol = Chem.MolFromSmiles("B") >>> [a.GetAtomicNum() for a in mol.GetAtoms()] [5] >>> mol = Chem.MolFromSmiles("[b]1c1") >>> Chem.MolToSmiles(mol) 'b1c1' >>> mol = Chem.MolFromSmiles("b1c1") >>> Chem.MolToSmiles(mol) 'b1c1' >>> [a.GetAtomicNum() for a in mol.GetAtoms()] [5, 6, 6, 6, 6, 6] What is the error message? Can you post a reproducible? Cheers, Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Calculating MorganFingerprint Counts for large number of Molecules
On May 18, 2018, at 17:48, Jennifer Hemmerichwrote: > I really liked the idea and I implemented it as follows: > df = pd.DataFrame(columns=counts.keys()) > for i,fp in enumerate(allfps): > logger.debug('appending %s', str(i)) > df.append(fp, ignore_index=True) While this saves a bit over creating an (N=100,000 x 2**32) data frame, it still creates a dataframe with (N x #unique keys). This is probably where your memory is going. Your next operation selects only those with a given number of fingerprints in it, and throws away the rest of the data. > df = df[df.columns[df.sum(axis=0) >= cut]] What you can do instead is to select only those keys with at least cut elements, before making the DataFrame. One way to do that is to use the collections.Counter class, which is like a collections.defaultdict(int) with the addition that it implements a "most_common()" method. When called by with no arguments it returns the list of (key, value) pairs, sorted from largest value to smallest. That means you can use a loop like: selected_keys = [] for idx, count in counts.most_common(): if count < cut: break selected_keys.append(idx) to select only the keys which are at or above the cut value. Once you have those keys, you can create the rows which are then passed to the DataFrame. The following seems to do what you are looking for. At the very least, it gets you more of the way there. from rdkit.Chem import AllChem from collections import Counter import pandas as pd def get_ecfp(mols, n=3, cut=10): allfps = [] counts = Counter() for i, mol in enumerate(mols): fp = AllChem.GetMorganFingerprint(mol, n).GetNonzeroElements() allfps.append(fp) for idx, count in fp.items(): # use "+= count" for the most frequent feature" counts[idx] += count # use "+= 1" for the feature in the most fingerprints #counts[idx] += 1 selected_keys = [] for idx, count in counts.most_common(): if count < cut: break selected_keys.append(idx) rows = [] for fp in allfps: rows.append([fp.get(key, 0) for key in selected_keys]) df = pd.DataFrame( rows, columns=selected_keys) return df if __name__ == "__main__": mols = [] for smi in ("c1c1O", "CNO", "C#N"): mol = AllChem.MolFromSmiles(smi) mols.append(mol) print(get_ecfp(mols, cut=2)) Note that I construct a list of rows which I then pass in to the DataFrame at once, rather then append() one row at a time as your code does. This is because the Pandas documentation says, at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html : Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once. Cheers, Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] RDKIT 2018.3 and MMPDB problem
And I have uploaded a source tar.gz and a binary wheel to PyPI. That means you can do "pip install mmpdb" to install this most recent version. Andrew da...@dalkescientific.com > On May 9, 2018, at 18:04, Kramer, Christian> wrote: > > Dear all, > > Andrew's fix for the wildcard atom representation in RDKit 2018.3 has just > been incorporated into the main mmpdb branch. If you download the latest > version of mmpdb, it should now work with with the current version and also > with previous RDKit versions. > > Note that you have to rebuild your mmp database if you switch from a pre 2018 > version of RDKit to a 2018+ version. > > Bests, > Christian > > Dr. Christian Kramer > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] RDKIT 2018.3 and MMPDB problem
Dear Marco, > On May 7, 2018, at 23:59, Marco Stentawrote: > I had some time to set an environment for it and test it: it works fine, as > far as my tests go. I will switch to this version and to the latest RDKIT now. Thanks for the feedback. Someone else sent me a private email also saying it worked. I'll put together the final changes for a 2.1 release, and see about making it accessible from PyPI so "pip install mmpdb" will work. > some questions: > Is there any plan to: > include MCS as a fragmentation method? > extend to matched series? > include "fuzzy" environment definitions based on pharmacophores (as BI people > did)? I know of no plans for that. As a consultant, my answer is more along the lines of "are you willing to pay for it?" :) It won't be cheap. Speaking of which, many thanks to Roche because they funded this work, just like that previously funded me to develop the MCS code that is now in RDKit. > I am currently using the database file to explore the rules mainly via visual > inspection through spotfire by means of a series of joins to generate > suitable tables: would anybody be interested in this (also helping improving > it)? I hope you find others and are able to get this out there. One of the reasons for developing what eventually became mmpdb 2.0 is to make this sort of viewer possible. That is, the unreleased mmpdb 1.0 didn't have canonical attachment point assignment, resulting in up to 6 different ways to represent a 3-cut fragment. This made it technically challenging to the simple sorts of analyses that mmpdb 2.0 does now, and impossible to visualize meaningfully. Cheers, Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] RDKIT 2018.3 and MMPDB problem
On Apr 27, 2018, at 00:20, Andrew Dalke <da...@dalkescientific.com> wrote: > Please try out: > http://dalkescientific.com/mmpdb-2.1b1.tar.gz > > or my fork at: > https://github.com/adalke/mmpdb > > and let me know of any problems. Has anyone downloaded and tested this code? I would like to push it to the main repository and start the process of doing a new release. Cheers, Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] RDKIT 2018.3 and MMPDB problem
On Apr 27, 2018, at 00:20, Andrew Dalke <da...@dalkescientific.com> wrote: > It does not appear that the .fragment files also need to be redone, so > rebuilding the .mmpdb file is mostly a matter of re-running the index step. I no longer think that is correct. While indexing will work, the resulting MMP database can no longer be used for making predictions. When you upgrade to RDKit 2018, you'll need to rebuild your mmpdb data sets from scratch. The same will be true any time the RDKit canonicalization algorithm changes. Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] RDKIT 2018.3 and MMPDB problem
On Apr 26, 2018, at 12:38, Andrew Dalke <da...@dalkescientific.com> wrote: > The automated mmpdb test suite isn't that good, so I still need to do some > manual testing. I won't be able to get to this until (hopefully) this evening. I did that, and tracked down one more bug. Please try out: http://dalkescientific.com/mmpdb-2.1b1.tar.gz or my fork at: https://github.com/adalke/mmpdb and let me know of any problems. NOTE! Because of this change, anyone who switches from a pre-2018 RDKit to the most recent RDKit must rebuild the .mmpdb files. It does not appear that the .fragment files also need to be redone, so rebuilding the .mmpdb file is mostly a matter of re-running the index step. Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] RDKIT 2018.3 and MMPDB problem
On Apr 26, 2018, at 10:09, Marco Stentawrote: > > Dear Colleagues, > I just installed on conda env the new rdkit version > and wanted to try mmpdb but upon testing I got the error below > reverting back to rdkit=2017.09.3.0 it works fine (I still get some errors > but it goes thrugh) > > It is a bug, am I doing anything obviously wrong? You did nothing wrong. The latest version of RDKit changed the SMILES output for wildcard atoms from "*" to "[*]". I missed this change in the release notes. It's further documented at https://github.com/rdkit/rdkit/pull/1788 The mmpdb code does a lot of manipulation of the SMILES at the syntax level, and parts of the code expected that wildcard atoms would only be in the form "[*]". I've put together a first go at a fix, which should work for both older and the newest versions of RDKit. The automated mmpdb test suite isn't that good, so I still need to do some manual testing. I won't be able to get to this until (hopefully) this evening. Until then, if you could try out http://dalkescientific.com/mmpdb-2.1b1.tar.gz or my fork of the repository at https://github.com/adalke/mmpdb and let me know if there are failures, that would help me with the testing. Cheers, Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] RDKIT 2018.3 and MMPDB problem
On Apr 26, 2018, at 10:09, Marco Stentawrote: > > Dear Colleagues, > I just installed on conda env the new rdkit version > and wanted to try mmpdb but upon testing I got the error below > reverting back to rdkit=2017.09.3.0 it works fine (I still get some errors > but it goes thrugh) > > It is a bug, am I doing anything obviously wrong? You did nothing wrong. The latest version of RDKit changed the SMILES output for wildcard atoms from "*" to "[*]". I missed this change in the release notes. It's further documented at https://github.com/rdkit/rdkit/pull/1788 The mmpdb code does a lot of manipulation of the SMILES at the syntax level, and parts of the code expected that wildcard atoms would only be in the form "[*]". I've put together a first go at a fix, which should work for both older and the newest versions of RDKit. The automated mmpdb test suite isn't that good, so I still need to do some manual testing. I won't be able to get to this until (hopefully) this evening. Until then, if you could try out http://dalkescientific.com/mmpdb-2.1b1.tar.gz or my fork of the repository at https://github.com/adalke/mmpdb and let me know if there are failures, that would help me with the testing. Cheers, Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] 2018.03.1 RDKit release
On Apr 25, 2018, at 01:31, David Hallwrote: > You need to turn off RDK_INSTALL_INTREE Thanks! I've put that my build notes for the next time I compile RDKit. BTW, a quick benchmark of the new release shows that it's almost 15% faster at parsing SMILES strings than 2016.09.3 That is, parsing the SMILES strings from ChEBI went from 17.0 to 14.6 seconds. Nice work! Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] 2018.03.1 RDKit release
> On Apr 23, 2018, at 10:43, Greg Landrumwrote: > > I'm pleased to announce that the next version of the RDKit - 2018.03 - is > released. The release notes are below. ... > Please let me know if you find any problems with the release or have > suggestions for the next one, which is scheduled for Octobera 2018. I'm likely one of the few who builds RDKit manually. I also build on a Mac but I don't use the system/homebrew Python or Boost but rather have my own installations under a virtual environment. Which makes updates "fun". The build step produced errors like: "links to target 'Boost::serialization' but the target was not found." "links to target 'Boost::python3' but the target was not found." The solution was to upgrade my cmake from 3.7.2 to 3.11.1. Note however that the Install.md says: cmake. You need version 3.1 (or more recent) This should be updated. I don't know which cmake version is the minimum required. By the way, how do I install RDKit into a specified location? I usually expect something like --prefix /usr/local, and there's a CMAKE_INSTALL_PREFIX which defaults to "/usr/local" but "make install" puts it in the build directory, like /Users/dalke/ftps/rdkit-Release_2018_03_1 My solution is to install it myself, like: cd ~/venvs/py36-2018-4/lib/python3.6/site-packages/ cp -rp ~/ftps/rdkit-Release_2018_03_1/rdkit . python -m compileall rdkit cd ~/venvs/py36-2018-4/lib cp -p ~/ftps/rdkit-Release_2018_03_1/lib/lib* . but the shared libraries uses a relative path: File "/Users/dalke/venvs/py36-2018-4/lib/python3.6/site-packages/rdkit/__init__.py", line 2, in from .rdBase import rdkitVersion as __version__ ImportError: dlopen(/Users/dalke/venvs/py36-2018-4/lib/python3.6/site-packages/rdkit/rdBase.so, 2): Library not loaded: @rpath/libboost_python36.dylib Referenced from: /Users/dalke/venvs/py36-2018-4/lib/python3.6/site-packages/rdkit/rdBase.so Reason: image not found % otool -L /Users/dalke/venvs/py36-2018-4/lib/python3.6/site-packages/rdkit/rdBase.so /Users/dalke/venvs/py36-2018-4/lib/python3.6/site-packages/rdkit/rdBase.so: @rpath/libRDKitRDBoost.1.dylib (compatibility version 1.0.0, current version 2018.3.1) @rpath/libboost_python36.dylib (compatibility version 0.0.0, current version 0.0.0) @rpath/libboost_serialization.dylib (compatibility version 0.0.0, current version 0.0.0) @rpath/libRDKitRDGeneral.1.dylib (compatibility version 1.0.0, current version 2018.3.1) /usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 307.4.0) /usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current version 1238.0.0) which means I've always had to tweak my virtualenv 'activate' command to set/unset DYLD_LIBRARY_PATH. (It looks like there are ways to tweak the @rpath but I haven't figured it out.) I would prefer to have something like cmake . --prefix ~/venvs/py36-2018-4 make install and skip all of this by-hand tweaking. Does it exist already and I just missed it? Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?
On Apr 23, 2018, at 14:54, Brian Colewrote: > Unfortunately it doesn't work on circular/ECFP-like fingerprints. To be fair, you didn't mention that was a requirement. ;) > It has the requirement that the fingerprint be a substructure fingerprint as > you described. Could you elaborate on your goal? I used RDKitFingerprint because it was the easiest. It was something I could do in a day to demonstrate that it is possible to reverse engineer some fingerprints. I think it's possible to do something similar for circular fingerprints. It would mean generating all possible subgraphs of a given radius, which is doable for r=2 or r=3, and probably r=4. RDKit has a way to look at the circular environment around a a specific atom rather than the entire fingerprint, so that can be used to generate a seed point. Once that's found, it can be expanded to one of its neighbor atoms. Another problem is that the Morgan fingerprint algorithm really wants sanitized structures, which I didn't need to worry about for the hash fingerprints. Instead of a day of work, it's going to take a couple of weeks of work, which requires time and money. My advice though is that it's surely possible to determine some structure information from the circular fingerprint. If your use case says there should be no information leak (other than what's possible by full brute-force-enumeration) then don't exchange fingerprints. But leaking information is not really what I thought of by "reverse engineer". For example, if I want to check if any of the Morgan fingerprints with r=2 contain a phenol, I can ask RDKit to generate the fingerprint for r=2 using just the c(O)c as the fromAtoms. This gives: % echo '*c1ccc(O)cc1 phenol' | rdkit2fps --from-atoms 3,4,5,6 --morgan #FPS1 #num_bits=2048 #type=RDKit-Morgan/1 radius=2 fpSize=2048 useFeatures=0 useChirality=0 useBondTypes=1 fromAtoms=3,4,5,6 #software=RDKit/2016.09.3 chemfp/3.2.1 #date=2018-04-23T14:03:24 00028200140044000200 phenol I can then screen using that fingerprint to see which fingerprint match. Of the first 100,000 structures in ChEMBL, 2216 contain phenol, all of the are detected by this screen, and there are no false positives. Poof - structural information leakage. The code is at the bottom of this email. It depends on the commercial version of chemfp. > It seems the evolutionary/genetic algorithm approach is the current > state-of-the-art for decoding circular/ECFP-like fingerprints. Dave Cosgrove mentioned Dave Weininger's GA work, which means it was with Daylight hash fingerprints. I don't think we know that GAs have ever been used to reverse engineer circular fingerprints. > Historical question for you since you're the closest we have to a > chem-informatician historian. :-) Why did these circular/ECFP fingerprints > come into existence? I believe you are asking for https://pubs.acs.org/doi/abs/10.1021/ci100050t . Extended-connectivity fingerprints (ECFPs) are a novel class of topological fingerprints for molecular characterization. Historically, topological fingerprints were developed for substructure and similarity searching. ECFPs were developed specifically for structure−activity modeling. > my reading of the current literature is that tree/dendritic are statistically > just as good at virtual screening as circular/ECFP: Yeah, I don't go there. I leave concepts like "just as good" or "better" to people who have experimental data they can use for the comparison. Andrew da...@dalkescientific.com == Code to find which Morgan fingerprints contain a phenol substructure == import chemfp from chemfp import bitops, search arena = chemfp.load_fingerprints("chembl_23_morgan.fps", reorder=False) print("Fingerprint type:", arena.metadata.type) # Want to find structures containing phenol # Adjust the fingerprint type to limit it to the given atoms fptype = chemfp.get_fingerprint_type(arena.metadata.type + " fromAtoms=3,4,5,6") query_fp = fptype.parse_molecule_fingerprint("*c1ccc(O)cc1", "smi") print("Query fingerprint:") print(bitops.hex_encode(query_fp)) print() # Find the matching fingerprints result = search.contains_fp(query_fp, arena) circular_ids = set(result.get_ids()) # Search the first 100,000 structures from rdkit import Chem from chemfp import rdkit_toolkit as T pat = Chem.MolFromSmarts("*c1ccc(O)cc1") all_ids = set() exact_ids = set()
Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?
On Apr 22, 2018, at 20:22, Nils Weskampwrote: > Actually, I *was* also thinking about your use cases 2 and 3 since you > also need some form of hash function to map substructures to bit > numbers. This is normally a rather simple function / pseudo random > generator, Strictly speaking, this is not a requirement. The term "fingerprint" has taken on quite an encompassing meaning since 1990. The molecular formula is a count fingerprint with 118 keys, based on the atomic number. There's no need for hash function there. "CCO" might be: [0, 0, 0, 0, 0, 2, 0, 1, ...] Or it can be written in more compact form like {"C": 2, "O": 1}. As an alternative, I could use a mapping from canonical substructures to counts, so "CCO" becomes: {"C": 2, "O": 1, "CC": 1, "CO": 1, "CCO": 1} This doesn't require a hash. (While I represent that as a Python dictionary, which uses a hash table underneath, it could be implemented using a red-black tree or B-tree, or with a simple linear search.) It's only if I want to convert this into fixed length representation where I have to figure out some sort of encoding scheme. Even then, I don't need a PRNG or hash seed. Suppose I use a bit vector. I could have a table which maps all canonical substructures to its bit pattern. If I have an unknown fragment, I could use RANDOM.ORG to get the bits. Downsides include potentially unbounded table growth and the need for a centralized table. This is the approach that Zatocoding used, and I see Chemical Zatocoding as the only precursor to Daylight hash fingerprints. > which could of course also be changed to something expensive to calculate. Yes, that could be possible. Abstractly, let the first 20 bytes of each fingerprint be a salt, and use something like bcrypt so each fingerprint test requires that the query structure be re-fingerprinted for the per-fingerprint hash function. It would, however, take an absurdly long time to do a similarity search. And in any case, before going further along that path, we would need to figure out the risk model. Brian started by saying that he wanted to obfuscate molecules for security, but didn't say what he want to use them for, and if he want to secure them against nation-states, or simply against me. ;) Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?
On Apr 22, 2018, at 08:42, Nils Weskampwrote: > Nice work. If brute-force approaches like this (or methods based on > genetic algorithms etc.) are the only way to reverse a fingerprint, one > could probably come up with a fingerprint that allows for pretty secure > structure sharing by using many iterations of a strong cryptographic > hash function that is really slow to calculate. I usually think of "brute force" as something more like "enumerate all possible structures, generate the fingerprint for each one, and compare." I think of what I did here as a bit more elegant than that. ;) I think of fingerprints as being applied to three uses: 1) identity test if mol1 == mol2 then fp(mol1) == fp(mol2) => if fp(mol1) != fp(mol2) then mol1 != mol2 A good identity fingerprint has the properties that the false positive rate of fp(mol1) == fp(mol2) but mol1 != mol2 is low, and that working with fingerprints gives advantages over a graph isomorphism check. 2) Substructure screening if mol1 is a subgraph of mol2 then fp(mol1) is a subset of fp(mol2). (There are different definitions of 'subset' for different fingerprint types.) An effective substructure screen fingerprint has the properties that: fp(mol1) is a subset of fp(mol2) => a higher likelihood that mol1 is a subgraph of mol2 -and- fingerprint subset test is faster than subgraph isomorphism testing 3) Similarity comparison Use similarity of fp(mol1) and fp(mol2) as a proxy/estimate of the similarity of mol1 and mol2. Usually we also assume that computing the similarity of two fingerprints is fast. A cheminformatics fingerprint usually supports #1 and one or both of #2 or #3. If we only need #1, which is the use case Nils brought up, then we could use a SHA256 of the canonical SMILES string, or use the InChI Key. These are fixed-length binary fingerprints which can only be used for identity testing, and would give a low false positive rate. The structure leakage comes from needing support for #2 and/or #3. I don't see any reasonable way to make a fingerprint that can do #2 and/or #3 without being open to some sort of enumeration scheme that is more clever than brute force. Possibly some scheme related to homomorphic encryption might work? As I understand it, this would be unreasonable slow for what most people expect from fingerprints. Cheers, Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?
On Apr 21, 2018, at 01:55, Andrew Dalke <da...@dalkescientific.com> wrote: > Hand-waving sketch: start with a carbon. Generate fingerprint. It should pass > the screening test. If not, the structure contains no carbons, so repeat with > other elements until you find an atom which passes. Successively either add > an atom+bond or connect two existing atoms with a bond, fingerprint the > result, and do the screening test. If it does not pass then that modification > was not permitted. Use a breadth-first search which prioritizes branching and > rings to avoid chains longer than the maximum enumeration size. Here's an implementation of that sketch, applied to the RDKit hash fingerprint: http://dalkescientific.com/rev_eng_fp.py It works well for small structures: % python rev_env_fp.py No SMILES given. Using caffeine. Current best guess is C=C with 2 bits of 759 Current best guess is Cc=O with 6 bits of 759 Found! Cn1c(=O)c2c(ncn2C)n(C)c1=O Here's aspirin: % python rev_env_fp.py 'O=C(C)Oc1c1C(=O)O' Found! CC(=O)Oc1c1C(=O)O Capsicum is close, only missing a methyl in the tail. % python rev_env_fp.py 'O=C(NCc1cc(OC)c(O)cc1)C=CC(C)C' Current best guess is CNC(=O)C=CC(C)C with 100 bits of 384 Current best guess is CC=CC(=O)NCc1ccc(O)c(c1)OC with 376 bits of 384 Best guess is CC=CC(=O)NCc1ccc(O)c(c1)OC with 376 bits of 384 For omeprazole it only finds half of the structure: % python rev_env_fp.py 'COc1ccc2nc([nH]c2c1)S(=O)Cc1ncc(C)c(OC)c1C' Current best guess is Cc1c(C[SH]=O)ncc(C)c1OC with 469 bits of 863 Best guess is Cc1c(C[SH]=O)ncc(C)c1OC with 469 bits of 863 For estradiol it gets stuck finding another cyclohexane instead of the cyclopentane: % python rev_env_fp.py 'C[C@]12CC[C@@H]3c4ccc(cc4CC[C@H]3[C@@H]1CC[C@@H]2O)O' Current best guess is CC12CCC(O)C21C with 163 bits of 583 Current best guess is CC12(C1)C1c3ccc(O)cc3CCC1C2 with 477 bits of 583 Best guess is CC12(C1)C1c3ccc(O)cc3CCC1C2 with 477 bits of 583 Note: it's currently set up to only consider the elements ["C", "c", "O", "o", "N", "n", "S", "s", "F", "Cl", "Br"] Edit the 'elements' list of you want to include more possibilities. This is more likely to run into a dead-end. The current code assumes that when I grow by one atom, if fp(mol + new atom) is a subset of the target fingerprint, then mol + new_atom is a subgraph of the target structure. This can be resolved by setting up a search tree, but then it needs to be more careful about backtracking and pruning, and that's too much work for an evening of programming. Cheers, Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?
On Apr 20, 2018, at 19:03, jeff goddenwrote: > > Long ago molecular fingerprints were referred to in the literature as > molecular hash functions. (y'know, those crazy mathematical algorithms which > permitted rapid lookup of some string in a lookup table) Do you have a reference or any more detail for that? As far as I can tell, effectively everyone used fragment-based screens from the punched-card era of the 1940s up until the late 1980s. It wasn't until the early 1990s when the word "fingerprint" appears in the cheminformatics literature, in a paper by John Barnard referencing the Daylight fingerprints. I haven't come across any use of molecular hash functions for fingerprint-like descriptors before the Daylight work, and I looked pretty hard for one. Instead, I get the feeling that there was some attempt in the 1990s to distinguish between these two approaches, and then over time the term "fingerprint" took on a much broader meaning than "Daylight's binary fingerprint based on hash-encoding of enumerated linear subgraphs." That is, I think that "molecular hash functions" post-date "fingerprint". FWIW, a pubs.acs.org search for "molecular hash" found only three papers; one from 2005 and the other two from 2015. The closest exception to to an earlier Daylight-like fingerprint is the superimposed coding created by Mooers in the 1940s, mentioned in Ray and Kirsch (1957), put to use in Feldman and Hodes (1975), and further used at a couple of places since then. These *are* connected; Mooers in his "Codes and Coding" entry from "Encyclopedia of Library and Information Science" remarks: A rudimentary form of use of random patterns was discovered by computer people about 10 years later for fast look-up in tables. This simpler form, which effectively uses only a single descriptor, is known in the computer industry as "hash coding."" Weininger and Mooers both drew from the same information theory concepts to develop fingerprints and superimposed coding, respectively. But they aren't the same. I found hashing used for other molecular-related topics, like WLN and IR spectra lookups, but these didn't seem to be *molecular* hash functions. > As such, we expected for their to be the associated hash collisions > (https://en.wikipedia.org/wiki/Hash_table#Collision_resolution ). As Peter Shenkin pointed out, this isn't a given. In the MMP code I helped develop (mmpdb - https://github.com/rdkit/mmpdb ), one of the novel features is the ability to match not just the pairs but the local attachment environment, based on the circular environment of r=1 up to r=5 from the attachment point. I created a fingerprint for that based on the fingerprints for the individual circular environments, concatenated together, and then SHA256'ed to get a unique characteristic. Unlike the Daylight approach, this fingerprint can only be used to check for identity. The requirement that a fingerprint be used for similarity and substructure screens makes them much larger than needed for a simple identity check. And as Dave Cosgrove rightly pointed out, this extra information makes it possible to reverse engineer a (Daylight-style hash fingerprint) to find a molecular graph which is at least isospectral to the original structure. Hand-waving sketch: start with a carbon. Generate fingerprint. It should pass the screening test. If not, the structure contains no carbons, so repeat with other elements until you find an atom which passes. Successively either add an atom+bond or connect two existing atoms with a bond, fingerprint the result, and do the screening test. If it does not pass then that modification was not permitted. Use a breadth-first search which prioritizes branching and rings to avoid chains longer than the maximum enumeration size. You'll also need to allow aromatic atoms in a non-ring so you can do the growth correctly. ECFP-style circular fingerprints are not designed for substructure screens so cannot be reverse-engineered this easily. It would be interested to try the GA method that Dave Cosgrove suggested. I know of no papers concerning this topic, and I doubt that Dave Weininger ever published anything about it. He wasn't much into publishing in the scientific literature. Going back to the mmpdb environment fingerprint, it was also designed so that organizations can feel a bit more comfortable sharing MMP data with other organizations, since (like the InChI-Key) it's not possible to guess what an mmpdb environment fingerprint describes unless you 1) have it already, or 2) are willing to brute-force reasonable chemistry space to find it. Interestingly, this use of "fingerprint" is more closely aligned with Rabin's 1981 work on fingerprints - cryptographic hashes used for identify checking; see http://www.xmailserver.org/rabin.pdf - than with Daylight fingerprints. When I asked a few years ago, Dave Weininger did not recall how they
Re: [Rdkit-discuss] issue during parsing a smile
On Apr 16, 2018, at 16:29, Guillaume GODINwrote: > And for this one C[C@@]12CC[C@@](C)(CC1)O2O any idea > > Cause your tool failed too. It's true that smiview failed, in the sense that it shouldn't have tried to do further analysis with a molecule that RDKit rejected. However, RDKit does report the failure as "Explicit valence for atom # 8 O, 3, is greater than permitted": % smiview 'C[C@@]12CC[C@@](C)(CC1)O2O' [16:45:37] Explicit valence for atom # 8 O, 3, is greater than permitted Unable to generate a molecule: RDKit cannot parse the SMILES Traceback (most recent call last): File "/Users/dalke/venvs/py36-all/bin/smiview", line 11, in load_entry_point('smiview', 'console_scripts', 'smiview')() failure report that smiview should have prevented from happening ... A hypothetical future version of smiview will capture that RDKit error message so it can pinpoint the error location more visually. Cheers, Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] issue during parsing a smile
If you try this out with my smiview package, available from https://bitbucket.org/dalke/smiview/downloads/ , it reports: % smiview 'C\(C(C)C)=N/O' Cannot parse --smiles: Unexpected term C\(C(C)C)=N/O ^ Tokenizing stopped here A bond must be followed by an atom, closure. That is, the bond symbol '\' may not be followed by a branch. (And I should improve that error message so it's more explicit.) I think you are looking for the SMILES "C(\C(C)C)=N/O". Andrew da...@dalkescientific.com > On Apr 16, 2018, at 16:15, Guillaume GODIN> wrote: > > Dear All, > > Anyone know why this smile is not parse correctly ? > > C\(C(C)C)=N/O > > Best regards, > > Guillaume > *** > DISCLAIMER > This email and any files transmitted with it, including replies and forwarded > copies (which may contain alterations) subsequently transmitted from > Firmenich, are confidential and solely for the use of the intended recipient. > The contents do not represent the opinion of Firmenich except to the extent > that it relates to their official business. > *** > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! > http://sdm.link/slashdot___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] reassembling a molecule from R-groups
On Apr 16, 2018, at 05:37, Patrick Walterswrote: > > Thanks Andrew, the SMILES approach seemed to have quite a few edge cases so I > wrote something to work directly on a molecule. That's the approach I started with, until I figured out that it doesn't preserve chirality. If I change the end of your code to: == from mmpdblib import smiles_syntax def weld_dalke(core, r_groups): s1 = smiles_syntax.convert_labeled_wildcards_to_closures(core) s2 = smiles_syntax.convert_labeled_wildcards_to_closures(r_groups) return Chem.CanonSmiles(s1+"."+s2) if __name__ == "__main__": mol_to_weld = Chem.MolFromSmiles( "[*:1][C@](F)(Cl)O.N[*:1]") welded_mol = weld_r_groups(mol_to_weld) print("Expected :", Chem.CanonSmiles("N[C@](F)(Cl)O")) print("Direct:", Chem.MolToSmiles(welded_mol, isomericSmiles=True)) print("Via SMILES:", weld_dalke("[*:1][C@](F)(Cl)O", "N[*:1]")) == These should print identical SMILES strings, but instead give: Expected : N[C@](O)(F)Cl Direct: N[C@@](O)(F)Cl Via SMILES: N[C@](O)(F)Cl If chirality preservation isn't a concern, then there's no problem. BTW, your current code assumes there will only be one attachment point on an atom. For example, the input [*:1][C@]([*:2])(Cl)O.N[*:1].F[*:2] create the output N.O[C](F)Cl It's not hard to fix, and I think more of a d'oh! issue. In a quick benchmark I put together just now, I found that my SMILES syntax manipulation approach was about twice as fast to turn the two core/R-group SMILES strings into a molecule. Cheers, Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] reassembling a molecule from R-groups
Hi Pat, I wrote something like this for mmpdb, which is the MMPA code I helped develop, at https://github.com/rdkit/mmpdb . It has one restriction, which I'll get to in a moment. The general idea is to convert the attachment points to closures, join them with a ".", and canonicalize: >>> from mmpdblib import smiles_syntax >>> s1 = >>> smiles_syntax.convert_labeled_wildcards_to_closures("CN(C)CC(Br)c1cc([*:2])c([*:1])cn1") >>> s1 'CN(C)CC(Br)c1cc%92c%91cn1' >>> s2 = >>> smiles_syntax.convert_labeled_wildcards_to_closures("[H]C([*:1])([H])[H].[H][*:2]") >>> s2 '[H]C%91([H])[H].[H]%92' >>> from rdkit import Chem >>> Chem.CanonSmiles(s1+"."+s2) 'Cc1ccc(C(Br)CN(C)C)nc1' The smiles_syntax.py file does not use any of the rest of the code. The restriction is that the code as-is assumes the wild card atoms like [*:1] are either immediately before or after the attachment point. Otherwise it will give you (using the R-groups you actually posted): >>> s2 = >>> smiles_syntax.convert_labeled_wildcards_to_closures("[H]C([H])([H])[*:1].[H][*:2]") Traceback (most recent call last): File "", line 1, in File "/Users/dalke/cvses/mmpdb/mmpdblib/smiles_syntax.py", line 130, in convert_labeled_wildcards_to_closures return convert_wildcards_to_closures(new_smiles, offsets) File "/Users/dalke/cvses/mmpdb/mmpdblib/smiles_syntax.py", line 98, in convert_wildcards_to_closures new_smiles, new_smiles[wildcard_start-1:]) NotImplementedError: ('intermediate groups not supported', '[H]C([H])([H])[*].[H][*]', ')[*].[H][*]') All this means is I didn't write the code to count the number of intermediate branches/matched parentheses between the attachment point a and the wildcard atom. ("Count" because I would need to invert any chirality on the base atom if there were an odd number of intermediate groups.) Such code wouldn't be hard to add. It's not there because my experience is that RDKit only placed the "*" atoms in one of those two locations. However, as I just learned, if you leave the hydrogens in then the [H] atoms have priority: >>> Chem.SanitizeMol(mol,Chem.SANITIZE_ALL^Chem.SANITIZE_CLEANUP^Chem.SANITIZE_PROPERTIES) rdkit.Chem.rdmolops.SanitizeFlags.SANITIZE_NONE >>> Chem.MolToSmiles(mol) '[H]C([H])([H])[*:1]' Then again, explicit [H] atoms aren't important for your end goal, so you could just recanonicalize all of your R-groups first, to ensure they are in the RDKit form, then use the SMILES rewriter. For what it's worth, I coined the term "welding" to describe this technique of converting the labeled R-groups into ring-closures, then "." (dis)connected them to parse them as a single odd-looking SMILES. Andrew da...@dalkescientific.com > On Apr 15, 2018, at 21:16, Patrick Walterswrote: > > Hi All, > > I was about to write a function to reassemble a molecule from a core + > R-groups, but I thought I'd check and see if such a function already exists. > This is work with the output of rdRGroupDecomposition > > Gvien a core: > CN(C)CC(Br)c1cc([*:2])c([*:1])cn1 > > Plus a set of R-groups > [H]C([H])([H])[*:1] > [H][*:2] > > Reconnect the pieces to generate a molecule > CN(C)CC(Br)c1ccc(C)cn1 > > Thanks, > > Pat -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] [Rdkit-announce] [Announcement] 7th RDKit UGM in Cambridge UK
On Apr 7, 2018, at 07:13, Greg Landrum <greg.land...@gmail.com> wrote: > Andrew Dalke (Dalke Scientific) will offer a course on Python and the RDKit I need to finalize what I'm going to cover. I've been going between two approaches. 1) Python programming for cheminformatics This is meant for someone who has some programming experience but has not worked with Python before. It will cover the basics of the language (variables, for loops, defining functions), with examples based on the RDKit. The students will work in an IPython notebook, and the exercises will also touch on pandas and a few other important tools. 2) Introduction to the RDKit and its ecosystem This is meant for people who have some Python programming experience and want to learn the basics of RDKit and how it works with other tools like IPython, pandas, and Postgres. I don't have a good sense of which approach is more helpful to the sorts of new(ish) RDKit users who would go to an RDKit UGM. I lean towards the first because my experience (when I taught similar courses 5-10 years ago) was that many students had no experience with Python, and certainly it wasn't something they learned in any formal sense. But times may have changed. If you have feedback or commentary, please let me know by private email. Thanks! Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] smiview 1.2
About 10 days ago I posted a prototype program called 'smiview', which displays information about the structure of a SMILES string. Thanks to feedback from a couple of users, and a deep urge to explore the idea, I've just released smiview 1.2, available from https://bitbucket.org/dalke/smiview/downloads/smiview-1.2.tar.gz . For details about what it can do, see the README at https://bitbucket.org/dalke/smiview . Some of the changes are: - lots of bug fixes - the SMILES tokenizer will now try to parse the contents of an atom in []s - the atom indicators now point to the first character of the element symbol(s) rather than the first character of the atom token (e.g., the "C" in "[35Cl]" and not the "[") - the 'closures' track now highlights the atoms involved in a minimal cycle for a closure, rather than the SMILES string between the two closure points - more control over some of the styles - there is now code to generate the molecular graph, which means smiview can also report errors like C11 (closure to itself) and C1C1 (two bonds between atoms) - new tracks, like "hcounts" to show the number of implicit hydrogens on each atom, and "symclasses" to show each atom's symmetry class - support for both RDKit and OEChem, or no toolkit, albeit with reduced functionality - options to modify the input SMILES so all atoms have explicit hydrogen counts, and to set the isotope and atom class fields base on the atom index, symmetry class, or element number. - cleaned up and re-organized the internals. It now uses an experimental property calculation dependency system, and has a "track manager" to organize the tracks. Here's what it looks like with most of the tracks enabled (which is rather overwhelming): % smiview 'Cn1c(=O)c2c(ncn2C)n(C)c1=O' --fancy ┌ 1 1 1 1 atoms│ 01 2 3 4 5 678 9 0 1 2 3 └ || | | | | ||| | | | | | byte offsets┌ 1122 └ 050505 token types[ AA%A(BA)A%A(AAA%A)A(A)A%BA SMILES[ Cn1c(=O)c2c(ncn2C)n(C)c1=O hcounts[ 30 0 0 0 0 010 3 0 3 0 0 branches┌*(..) *(.) └ *(.) closures┌ *1** * * *1 └ *2*.***2 . fragments[ 00 symclasses┌ 01 7 3 9 1 651 1 1 2 8 4 └ 10 2 3 I'll focus on just the closures, and give more emphasis to the element symbols which make up either end of the closure (marked with a "*") while the other atoms in the closure ring are marked with an "x": % smiview 'Cn1c(=O)c2c(ncn2C)n(C)c1=O' -b closures --closure-atom-style end-elements ┌ 1 1 1 1 atoms│ 01 2 3 4 5 678 9 0 1 2 3 └ || | | | | ||| | | | | | SMILES[ Cn1c(=O)c2c(ncn2C)n(C)c1=O closures┌ *1xx x x *1 └ *2x.xx*2 . With a bit of counting of *'s and x's you can see there's a ring of size 6 and another of size 5. Here's an example of the input syntax processing; I'll convert all of the atoms to use the bracket form, by adding the correct hydrogen count to each non-bracket atom: % smiview 'Cn1c(=O)c2c(ncn2C)n(C)c1=O' --use-brackets -a input-smiles -b none --width 80 input smiles[ Cn 1 c (= O ) c 2 c ( n c n 2 C ) n ( C ) c 1= O SMILES[ [CH3][n]1[c](=[O])[c]2[c]([n][cH][n]2[CH3])[n]([CH3])[c]1=[O] If you want the modified SMILES string from another program, or to copy it, then turn off the legend and use a large enough width. Here I'll also set the isotope to the atom index+1, which might be used as a way to tag the atoms: % smiview 'Cn1c(=O)c2c(ncn2C)n(C)c1=O' --set-isotope index+1 -a none -b none --width 10 --legend off [1CH3][2n]1[3c](=[4O])[5c]2[6c]([7n][8cH][9n]2[10CH3])[11n]([12CH3])[13c]1=[14O] Let me know what you think. Andrew da...@dalkescientific.com -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] smiview 1.1 - a console tool to view SMILES strings
Over the last few days I've developed a command-line tool that I call "smiview". It's a SMILES viewer. It isn't a depiction tool where the input is in SMILES but rather a tool to highlight different aspects of the SMILES string. I'll put some examples at the end. If you want to try it out you can download it from https://bitbucket.org/dalke/smiview/downloads/smiview-1.1.tar.gz or see the README from the project page at https://bitbucket.org/dalke/smiview As the README says, this was mostly built for fun. I would like to know if you use it for serious work. I have no long terms plan for this project. I think it would be cool (perhaps for pedagogical reasons) to have a GUI version in an IPython notebook, tied to a graphical depiction so you can see the connection between the SMILES terms and the depiction. I don't have those skills, so if it's something you want to do, you might look to this code as a starting point. It was developed under Python 3 and it seems to work under Python 2.7. Enjoy! Andrew da...@dalkescientific.com The default view highlights the atom locations (and numbers them), shows the branches and the atom that the branches are attached to, shows the closures (highlighting the closure atom and the closure indicator for both side of the closure), and shows you which part of the SMILES string correspond to which fragments. % smiview 'C#CCC[N+](C)(C)[N+](C)(C)CCC#C.Cc1ccc(S(=O)(=O)[O-])cc1.Cc1ccc(S(=O)(=O)[O-])cc1' ┌ 112 2 222 2 22 223 3 3 3 3 atoms│ 0 12345 6 78901234567890 1 234 5 67 890 1 2 3 4 └ | | | || | ||| | || ||| | | | | SMILES[ C#CCC[N+](C)(C)[N+](C)(C)CCC#C.Cc1ccc(S(=O)(=O)[O branches┌ *---(.)(.)*---(.)(.) *(... └ *(..)(..) closures┌1*↑ 1 └ ┌ 00 fragments│11 └ ┌33 33 344 4 4 4 444 atoms│56 78 901 2 3 4 567 └|| || ||| | | | ||| SMILES[ -])cc1.Cc1ccc(S(=O)(=O)[O-])cc1 branches┌ ..) *(.) └ *(..)(..) closures┌ *↑1 └1*↑ 1 *↑1 ┌ fragments│ 11 └ (Thanks to Greg for feedback on the 'branches' visualization, and for suggesting the 'fragments' track.) If you give it a SMARTS pattern it will show where the corresponding atoms are: % smiview 'CC1CC2C3CCC4=CC(=O)C=CC4(C)C3(F)C(O)CC2(C)C1(O)C(=O)CO' --smarts '*=O' ┌ 1 1 11 1 1 1 1 1 12 2 2 2 2 2 22 atoms│ 01 23 4 567 89 0 1 23 4 5 6 7 8 90 1 2 3 4 5 67 └ || || | ||| || | | || | | | | | || | | | | | || SMILES[ CC1CC2C3CCC4=CC(=O)C=CC4(C)C3(F)C(O)CC2(C)C1(O)C(=O)CO match 1[ * * match 2[* * If you give it an atom index, it shows the neighbors around that index, along with a summary of how the atom is attached to those neighbors: % smiview 'NC(=O)c1nonc1CNC(c1c[nH]cn1)C1CCNCC1' --atom-index 10 ┌1 1 11 11 1 11122 atoms│ 01 2 3 4567 890 1 23 45 6 78901 └ || | | ||| | || || | | SMILES[ NC(=O)c1nonc1CNC(c1c[nH]cn1)C1CCNCC1 neighbors┌ ^X ^ ^ └C(-N9)(-c11)(-C16) You can modify how things look. In the following I'll have it show the offset to each character in the SMILES string on the top, and the atom indices on the bottom. % smiview 'NC(=O)c1nonc1CNC(c1c[nH]cn1)C1CCNCC1' -a offsets -b atoms -b closures byte offsets┌ 112233 └ 05050505 SMILES[ NC(=O)c1nonc1CNC(c1c[nH]cn1)C1CCNCC1 ┌ || | | ||| | || || | | atoms│ 01 2 3 4567 891 1 11 11 1 11122 └0 1 23 45 6 78901 Here's the output for buckminsterfullerene: % smiview 'c12c3c4c5c1c6c7c8c2c9c1c3c2c3c4c4c%10c5c5c6c6c7c7c%11c8c9c8c9c1c2c1c2c3c4c3c4c%10c5c5c6c6c7c7c%11c8c8c9c1c1c2c3c2c4c5c6c3c7c8c1c23' --legend once ┌ 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 atoms│ 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 └ | | | | | | | | | | | | | | | | | | | | | | | | | | | | SMILES[ c12c3c4c5c1c6c7c8c2c9c1c3c2c3c4c4c%10c5c5c6c6c7c7c%11c8c9c8c branches[ ┌ 1*↑---*↑1 8*↑-- 8 --- 8 --- 8 ---*↑8 9* │ 2*-↑- 2 --*↑22*↑-- 2 --- 2 --- 2 │3*↑--- 3 *↑34*↑--- 4 --- 4 --- │