[Rdkit-discuss] mmpdb 3.1

2024-01-10 Thread Andrew Dalke
Hi everyone,

  We have released mmpdb 3.1, which you can get from 
https://github.com/rdkit/mmpdb .

mmpdb 3.0, released May 2023, merged three development tracks:

- create and query 1-cut med chem transformations as described in Awale et al., 
The Playbooks of Medicinal Chemistry Design Moves, J. Chem. Inf. Model. 2021, 
61, 2, 729–742

- support indexing large datasets on a distributed cluster

- replace the hash-based fingerprint environment with a SMARTS/pseudo-SMILES 
description

Version 3.1 adds support for the 2- and 3-cut med chem transformations 
described by Awale et al.

There are also a few feature improvements, some performance tuning, and bug 
fixes. See the CHANGELOG for details.


Andrew
da...@dalkescientific.com



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SDMolSupplier warning 2023.9.2

2023-12-12 Thread Andrew Dalke
Hi Mandar,

> On Dec 13, 2023, at 03:39, Mandar Kulkarni  
> wrote:
> I could not figure out how Rdkit is guessing it as 2D structure, as there is 
> no such information in SDF.

Line 2 of the SDF record looks something like:

 RDKit  2D

This line has the format (quoting from the documentation):

IIMMDDYYHHmmddSSssRR 
A2<--A8--><---A10-->A2I2<--F10.5-><---F12.5--><-I6-> )

where "User's first and last initials (l), program name (P), date/time 
(M/D/Y,H:m), dimensional codes (d) such as 2D or 3D, scaling factors (S, s), 
energy (E) if modeling program input, internal registry number (R)"

In the example I gave, most of these fields are blank, except for the program 
name ("RDKit") and the dimension code "2D" in columns 21-22 (if I counted 
right). The dimension code indicates the structure is expected to be in 2D.

> Is there any more information need to provide SDMolSupplier to make it 
> understand it's a 3D molecule?

It is only a warning. RDKit interprets the molecule as 3D despite the warning.

The file format documentation also says: "The “dimensional code” is maintained 
explicitly. Thus “3D” really means 3D, although “2D” will be interpreted as 3D 
if any non-zero Z-coordinates are found.

Best regards,


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] mol properties in SDWriter

2023-09-29 Thread Andrew Dalke
On Sep 26, 2023, at 01:17, Ling Chan  wrote:
> >(1) 
> 4.099
  ..
> Just wonder what was the rationale behind this extra "(1)" on the property 
> field lines (pKa and logP in the above example)?
> 
> And is there a way to get rid of these? I am not sure if this extra "(1)" is 
> part of the standard sd format.

RDKit uses the increasing value as a sort of per-file registry number.

This is follows the part of the standard which says "External registry numbers 
must be enclosed in parentheses."

The relevant code is in Code/GraphMol/FileParsers/SDWriter.cpp :

  if (d_molid >= 0) {
(*dp_ostream) << "(" << d_molid + 1 << ") ";
  }

There is no way to suppress this output. No only is there no direct way to 
change the d_molid, but d_molid cannot be negative as 
Code/GraphMol/FileParsers/MolWriters.h declares it as:

  unsigned int d_molid;  // the number of the molecules we wrote so far


Wim suggested a post-processing approach. Another is to write the SD data items 
yourself, that is, use MolToMolBlock() to generate the connection table/molfile 
as a string, then iterate through the properties and generate the data items.


import sys
from rdkit import Chem

def MolToSDFRecord(
mol,
includeStereo: bool = True,
confId: int = -1,
kekulize: bool = True,
forceV3000: bool = False):
mol_block = Chem.MolToMolBlock(mol, includeStereo, confId, kekulize, 
forceV3000)

lines = []
for prop_name in mol.GetPropNames():
if "\n" in prop_name or ">" in prop_name or "<" in prop_name:
sys.stderr.write(f"WARNING: Skipping property {prop_name!r} because 
the "
 "name includes an unsupported character.\n")
continue

prop_value = mol.GetProp(prop_name)
if "\n" in prop_value:
if "\n\n" in prop_value or "\r\n\r\n" in prop_value:
sys.stderr.write(f"WARNING: Skipping property {prop_name!r} 
because the "
 "value includes an embedded newline.\n")
continue
if prop_value.endswith("\r\n"):
prop_value = prop_value[:-2]
elif prop_value.endswith("\n"):
prop_value = prop_value[:-1]

lines.append(f"> <{prop_name}>\n{prop_value}\n\n")

lines.append("\n")

return mol_block + "".join(lines)

mol = Chem.MolFromSmiles("CCO")
mol.SetProp("pKa","3.3\r\n")
print(MolToSDFRecord(mol))


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] question on complexity of cannonization

2023-06-16 Thread Andrew Dalke
On Jun 16, 2023, at 03:15, S Joshua Swamidass  wrote:
> In graph theory, a planar graph is a graph that can be embedded in the plane, 
> i.e., it can be drawn on the plane in such a way that its edges intersect 
> only at their endpoints. In other words, it can be drawn in such a way that 
> no edges cross each other.

Years ago at 
http://www.dalkescientific.com/writings/diary/archive/2012/05/18/nonplanar_compounds.html
 I did a search of a subset of 28.5 million PubChem structures and found 224 
topologically non-planar examples, like 
https://pubchem.ncbi.nlm.nih.gov/compound/50919058 .

I also gave some literature citations, like "Synthesis of the first 
topologically non-planar molecule" (1981) at 
https://www.sciencedirect.com/science/article/abs/pii/0040403981800779 .

> On Jun 15, 2023, at 22:50, S Joshua Swamidass  wrote:
> 
> Have any other libraries adopted your approach? It's clever. 

It isn't my approach.

It depends on which part you consider clever. I've been told some of the ideas 
can be traced back to Bernhard Rohde's PhD thesis, citation 20 in the paper 
("who employed a stable numbering for equivalence classes instead of a 
sequential index"). Rohde is also thanked in the acknowledgements. And Rohde 
has/had his own in-house library at Novartis which included canonicalization.

I wouldn't be surprised if NextMove has their own implementation, given how 
Roger Sayle is a co-author of the paper.


> the covalent bonds of proteins of arbitrary size are a planar graph too, even 
> though most (all?) proteins have a 3D structure.

FWIW, due to disulfide bonds in cystines, proteins can be topologically 
non-planar. https://academic.oup.com/nar/article/47/D1/D367/5223942?login=false 
give PDB entry 1AOC as an example, with their database entry at 
https://knotprot.cent.uw.edu.pl/view/1aoc/A/ .

Best regards,

Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] question on complexity of cannonization

2023-06-15 Thread Andrew Dalke
On Jun 15, 2023, at 20:49, S Joshua Swamidass  wrote:
> 
> And what (generally speaking) is the algorithm used by rdkit? Do we know it's 
> complexity?

https://pubs.acs.org/doi/abs/10.1021/acs.jcim.5b00543

"Get Your Atoms in Order—An Open-Source Implementation of a Novel and Robust 
Molecular Canonicalization Algorithm"


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] question on complexity of cannonization

2023-06-15 Thread Andrew Dalke
On Jun 15, 2023, at 18:20, S Joshua Swamidass  wrote:
> It's well known that the graph-isomorphism problem is NP

While P is contained in NP, I don't think that's the NP you mean.

I suspect you may be thinking of subgraph isomorphism, which is NP-hard. Graph 
isomorphism may be quasi-polynomial time, if Babai's (unpublished) claim is 
correct.

Also, "Isomorphism of graphs of bounded valence can be tested in polynomial 
time" - https://www.sciencedirect.com/science/article/pii/002282900095 .


> So here is my question. What are the cases that are very difficult to 
> canonize a graph?

As I recall, handling chirality and other non-local properties is difficult. I 
have not worked on this problem.

Cheers,

Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] ANN: chemfp 4.1

2023-05-17 Thread Andrew Dalke
Hi everyone,

 I've just released chemfp 4.1. To install the pre-compiled package for 
Linux-based OSes do:

  python -m pip install chemfp -i https://chemp.com/packages/

For a detailed description of what's new, see:

  https://chemfp.readthedocs.io/en/latest/whats_new_in_41.html

As a summary, the new features in this release include:

- Supports RDKit 2023.03.1 and Python 3.8 through 3.11

- Interprets input SMILES as CXSMILES by default, with an option to turn that 
off

- Can save/load similarity search results to a NumPy file in a form compatible 
with SciPy compressed sparse matrices

- Implements Butina clustering, with several variations.

While building the similarity matrix may take an hour, the result can be saved 
to an npz file that the Butina implementation can use as input. This can be 
useful when tuning the Butina parameters because the NxN matrix can be 
constructed once, at the lowest reasonable threshold, while the Butina 
clustering can use a higher threshold. It takes only a few seconds to cluster 
ChEMBL at a threshold of 0.6.

- Sphere exclusion ("spherex") has been parallelized, with new options for 
specifying directed sphere exclusion ranking and a new output format compatible 
with the Butina output

- The new "chemfp csv2fps" tool for generating fingerprints from CSV files 
containing identifiers and molecules.

- The new "chemfp translate" tool for structure file format conversion.

These are available for no cost under the Chemfp Base License Agreement at 
https://chemfp.com/BaseLicense.txt .

For other licensing options, including no-cost license key for academic use, 
see https://chemfp.com/license/ .

Best regards,

Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Can a bond index be associated with order in explicit SMILES?

2023-05-17 Thread Andrew Dalke
On May 17, 2023, at 02:31, Vincent Scalfani  wrote:
> I thought that this might also be the case for bond indices, but that does 
> not appear to be correct (see example below). Is it possible to get a bond 
> index in the order of the SMILES? 

This may help you understand why that's a difficult question.

What does the bond index mean in something like

 C12.OC23.C3.C1

? Does the bond for closure 1 come first in the bond list, because that's where 
it start, or is it last, because that's where it ends? It looks like you think 
it should be the closure position.

Here's your SMILES labelled by atom index:

┌1 1 1  1  1 1  1   11 1
   atoms│ 0 1 2  3 4  5   6   7 8 9  0 1 2  3  4 5  6   78 9
└ | | |  | |  |   |   | | |  | | |  |  | |  |   || |
  SMILES[ C-C-c1:c:c:[nH]:c:1-C-C-C1-C-C-C(-c2:c:c:[nH]:c:2)-C-C-1

I used the program at the end of this email to print the information in bond 
list order:

In bondlist order
i Bnd# a1 ~ a2   frag
0   0   0 -  1   C-C
1   1   1 -  2   C-c
2   2   2 :  3   c:c
3   3   3 :  4   c:c
4   4   4 :  5  c:[nH]
5   5   5 :  6  [nH]:c
6   6   6 -  7   c-C
7   7   7 -  8   C-C
8   8   8 -  9   C-C
9   9   9 - 10   C-C
10  10  10 - 11   C-C
11  11  11 - 12   C-C
12  12  12 - 13   C-c
13  13  13 : 14   c:c
14  14  14 : 15   c:c
15  15  15 : 16  c:[nH]
16  16  16 : 17  [nH]:c
17  17  12 - 18   C-C
18  18  18 - 19   C-C
19  19   6 :  2   c:c
20  20  19 -  9   C-C
21  21  17 : 13   c:c


If you step through them you'll see that the closure atoms (2-6, 9-19, and 
13-17) are added to the bond list at the end, after processing the atoms which 
make up the spanning tree.

It appears the closure bond have the begin and end atom indices with the 
largest first, which makes it possible to tell that a given bond is a closure 
bond.

In principle then it should be possible to reorder the bonds to get the order 
you want.

This proved trickier than I could manage in the time I have.

Perhaps the better question is, why do you need the bond indices in a specific 
order?

Cheers,


Andrew
da...@dalkescientific.com


from rdkit import Chem

bond_symbols = {
   Chem.BondType.SINGLE: "-",
   Chem.BondType.DOUBLE: "=",
   Chem.BondType.TRIPLE: "#",
   Chem.BondType.AROMATIC: ":",
}

smi = "CCc1cc[nH]c1CCC1CCC(CC1)c1cc[nH]c1"
#smi = "[C@@](F)(Cl)(Br)O"
mol1 = Chem.MolFromSmiles(smi)
smi_explicit = Chem.MolToSmiles(mol1, allBondsExplicit=True)
mol2 = Chem.MolFromSmiles(smi_explicit)

def show(bonds):
   print(" i Bnd# a1 ~ a2   frag")
   for i, b in enumerate(bonds):
   a1, a2 = b.GetBeginAtomIdx(), b.GetEndAtomIdx()
   symbol = bond_symbols[b.GetBondType()]
   s = Chem.MolFragmentToSmiles(mol2, atomsToUse=[a1, a2], rootedAtAtom=a1, 
allBondsExplicit=True)
   print(f"{i:2d}  {b.GetIdx():2d}  {a1:2d} {symbol} {a2:2d} {s.center(8)}")

print(smi_explicit)
print("In bondlist order")
show(mol2.GetBonds())



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] how to get indexes and atoms with H from smiles

2023-05-09 Thread Andrew Dalke
On May 9, 2023, at 07:55, Haijun Feng  wrote:
> Can anyone help me figure out how to get each atom with H from the smiles as 
> above. Thanks so much!

Try using Chem.MolFragmentToSmiles to get the SMILES for each atom, with all 
hydrogens explicit, then strip off the leading and trailing []s.

from rdkit import Chem
mol=Chem.MolFromSmiles('c1c(C(N)=O)1')
for i, atom in enumerate(mol.GetAtoms()):
  atom_smi = Chem.MolFragmentToSmiles(mol, allHsExplicit=True, 
atomsToUse=[atom.GetIdx()])
  print(i, atom_smi.strip("[]"))

This prints

0 cH
1 cH
2 cH
3 cH
4 cH
5 c
6 C
7 NH2
8 O

Your code showed you using

   atom.SetProp('molAtomMapNumber',str(i))

In the following, I'll set that property *after* getting the atom SMILES, so 
the map is not included as part of the output:

from rdkit import Chem
mol=Chem.MolFromSmiles('c1c(C(N)=O)1')
for i, atom in enumerate(mol.GetAtoms()):
  atom_smi = Chem.MolFragmentToSmiles(mol, allHsExplicit=True, 
atomsToUse=[atom.GetIdx()])
  print(i, atom_smi.strip("[]"))
  atom.SetIntProp("molAtomMapNumber", i)

print(Chem.MolToSmiles(mol))

which gives the output

0 cH
1 cH
2 cH
3 cH
4 cH
5 c
6 C
7 NH2
8 O
[cH:0]1[cH:1][cH:2][cH:3][cH:4][c:5]1[C:6]([NH2:7])=[O:8]



> the output is: [cH:0]1[cH:1][cH:2][cH:3][cH:4][c:5]1C:6=[O:8]

For what it's worth, I get the slightly different:

   [cH:0]1[cH:1][cH:2][cH:3][cH:4][c:5]1[C:6]([NH2:7])=[O:8] 

You should be aware that the input order and the output SMILES order might be 
different.

Because of the simpler structure of your preferred output SMILES format, you 
can alternatively extract the atom terms from the output string by looking for 
the substrings inside of the []s, as in the following:

import re
>>> re.compile(r'\[[^]]+\]').findall("[cH:0]1[cH:1][cH:2][cH:3][cH:4][c:5]1[C:6]([NH2:7])=[O:8]")
['[cH:0]', '[cH:1]', '[cH:2]', '[cH:3]', '[cH:4]', '[c:5]', '[C:6]', '[NH2:7]', 
'[O:8]']

This list will exactly match the output SMILES atom order.

Cheers,


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Permutation of multiple enumeration

2022-07-06 Thread Andrew Dalke
Hi Carsten,

  How are the fragments expressed? With attachment points marked with "[*:1]", 
"[*:2]" and "[*:3]" atoms?

One technique is to rewrite the SMILES to use closures. (See 
https://onlinelibrary.wiley.com/doi/10.1002/qsar.200310008 or 
http://www.dalkescientific.com/writings/diary/archive/2005/05/07/attachment_points.html
 ).

For example, if your core SMILES are:

[*:1]c1ncc([*:2])cn1
CC([*:2])O[*:1]

and your R1 contains

*F
Cl*
Br*

and your R2 contains

*CCO
CO*

then you could rewrite these to use "%91" to connect the [*:1] with the R1 "*" 
and use "%92" to connect the [*:2] with the R2 "*", using dot-disconnected 
terms.

For example:

  [*:1]c1ncc([*:2])cn1 + *F + *CCO

can be rewritten as

  c%911ncc%92cn1.F%91.C%92CO

which is parsed and canonicalized to:

  OCCc1cnc(F)nc1

Rewriting the SMILES this way is a bit tricky. I've attached a program which 
does it for you.


Running it on the above gives:

% cat core.smi
[*:1]c1ncc([*:2])cn1
CC([*:2])N[*:1]

% cat r1.smi
*F
Cl*
Br*

% cat r2.smi
*CCO
CO*

% python enumerate.py --R1 r1.smi --R2 r2.smi core.smi
c1%91ncc%92cn1.F%91.C%92CO -> OCCc1cnc(F)nc1
c1%91ncc%92cn1.F%91.CO%92 -> COc1cnc(F)nc1
c1%91ncc%92cn1.Cl%91.C%92CO -> OCCc1cnc(Cl)nc1
c1%91ncc%92cn1.Cl%91.CO%92 -> COc1cnc(Cl)nc1
c1%91ncc%92cn1.Br%91.C%92CO -> OCCc1cnc(Br)nc1
c1%91ncc%92cn1.Br%91.CO%92 -> COc1cnc(Br)nc1
CC(O%91)%92.F%91.C%92CO -> CC(CCO)OF
CC(O%91)%92.F%91.CO%92 -> COC(C)OF
CC(O%91)%92.Cl%91.C%92CO -> CC(CCO)OCl
CC(O%91)%92.Cl%91.CO%92 -> COC(C)OCl
CC(O%91)%92.Br%91.C%92CO -> CC(CCO)OBr
CC(O%91)%92.Br%91.CO%92 -> COC(C)OBr

It also supports --R3 if your core has 3 R-groups, with the third core point 
labeled [*:3].

Best regards


Andrew
da...@dalkescientific.com


"""Enumerate a library with a core and up to 3 sets of R-groups

The core and R-group files contain one SMILES per line.

The core SMILES must use labeled "*" wildcards, like:

  [*:1]c1ncc([*:2])cn1

where [*:1] is the attachment point for R1, [*:2] is the attachment
point for R3, and [*:3] is the attachment point for R3.

The R-group SMILES must have a single unlabled "*" wildcard, like:

  CO*
-or-
  C(*)CO

The program is used like this:

   python enumerate.py --R1 r1.smi --R2 r2.smi --R3 r3.smi core.smi

"""

# Written by Andrew Dalke 
# 6 July 2022

import argparse
from rdkit import Chem
import itertools

# Generate a SMILES string such that the wildcard is NOT the first atom.
# This makes it easier to do an in-place string substitution of the
# wildcard term with a ring closure.
#
# The approach finds a non-wildcard atom and uses that as the root
# to generate the SMILES.
#
# Also verify the number of wildcard atoms is correct, and that
# they have one (and only one) single bond, no isotope, and no charge.
#
# Also verify that if it's supposed to be labeled (as for the core
# structures) then it uses [*:1], [*:2], or [*:3], and if it's
# not labeled (as for the R-groups) then it does not use labels

def get_reordered_smiles(mol, num_wildcards_required, is_labeled):
num_wildcards = 0
non_wildcard_atom = None
for atom in mol.GetAtoms():
# Do some validation
if atom.GetAtomicNum() == 0:
bonds = atom.GetBonds()
if len(bonds) != 1:
raise ValueError("'*' atom must have only one bond")
if bonds[0].GetBondType() != Chem.BondType.SINGLE:
raise ValueError("'*' atom must have only one single bond")
if atom.GetFormalCharge() != 0:
raise ValueError("'*' atom must be uncharged")
if atom.GetIsotope() != 0:
raise ValueError("'*' atom must have no isotope")
num_wildcards += 1

elif non_wildcard_atom is None:
# found a non-wildcard
non_wildcard_atom = atom

if num_wildcards != num_wildcards_required:
raise ValueError(
f"expecting {num_wildcards_required} '*' atoms, found {num_wildcards}")

if non_wildcard_atom is None:
raise ValueError(
"R-group must have at least one non-'*' atom")

# The generated SMILES does not begin with a wildcard.
smiles = Chem.MolToSmiles(mol, rootedAtAtom=non_wildcard_atom.GetIdx())

# Double check some assertion
if is_labeled:
# Ensure enough labeled wildcards are present
for substr in ("[*:1]", "[*:2]", "[*:3]")[:num_wildcards_required]:
if substr not in smiles:
raise ValueError(f"Expecting {substr}")
else:
# Ensure it isn't using labeled wildcards
if "[*:" in smiles:
raise ValueError(f"Must not be labeled wildcards")
 

[Rdkit-discuss] chemfp 4.0

2022-07-04 Thread Andrew Dalke
Hi all,

  I've recently released chemfp 4.0, with support for several diversity 
selection algorithms, and an improved API for interactive use in a notebook 
environment.

Chemfp is an analytics package for cheminformatics fingerprints. It contains 
command-line tools and an extensive Python library for fingerprint generation, 
high-performance similarity search, diversity selection, and exploratory 
research.

The new diversity selection algorithms are MaxMin, sphere exclusion (both 
random and directed), and HeapSweep.

Of special note, The MaxMin algorithm has improved support for selecting 
diverse fingerprints from a set of candidates (eg, a vendor catalog) which must 
also be diverse from a set of references (eg, a corporate collection), which is 
over an order of magnitude faster than the standard MaxMin algorithm.

People who live in the Jupyter notebook will likely enjoy the new chemfp user 
experience. Most long-term actions support progress bars, chemfp's Python 
objects have more informative repr()s, search results added Pandas integration, 
and there are new high-level APIs that let you express a lot of functionality 
compactly.

See https://chemfp.readthedocs.io/en/latest/whatsnew.html for more details.

To install the pre-compiled package for Linux at no cost, use:

  python -m pip install chemfp -i https://chemfp.com/packages/

The Base License covers most in-house use of chemfp, though a few features are 
either limited or disabled and require a license key to unlock. For alternative 
licenses, including source code and no-cost academic licensing, see 
https://chemfp.com/license/ -- or try one of the re-formatted ChEMBL datasets 
at https://chemfp.com/datasets/ which include an embedded authorization key.

Chemfp is not a cheminformatics toolkit. Instead, it knows how to use RDKit (or 
Open Babel, OpenEye's OEChem/OEGraphSim, or the CDK) for molecule I/O and 
fingerprint generation.

For more information see the chemfp home page at https://chemfp.com/ or contact 
me directly.

Best regards,
  
Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] how to report SDF records for which Chem.ForwardSDMolSupplier returns None?

2022-04-14 Thread Andrew Dalke
On Apr 14, 2022, at 12:57, Ivan Tubert-Brohman 
 wrote:
> How about splitting the file on lines consisting of "", and then parsing 
> each record? If the parsing fails, you can write out the bad record for 
> future inspection. (This addresses the basic use case, but not the "even 
> better" one.)

Yes, if you know your data is "clean", then you can do that.

I wrote an essay at
  
http://www.dalkescientific.com/writings/diary/archive/2020/09/18/handling_the_sdf_record_delimiter.html
about some of the ways that can cause problems.

They do occur in real-world data sets. And they do cause problems in some 
processing pipelines.

Public data sets like PubChem, ChEMBL, etc. don't have these problems. They are 
mostly in in-house data sets. Though it's not common to have a problem.

> def read_record(fh):
> lines = []
> for line in fh:
> lines.append(line)
> if line.rstrip() == '':
> return ''.join(lines)

See also 
https://baoilleach.blogspot.com/2020/05/python-patterns-for-processing-large.html
 .

The reasons I think there should be a low-level library for this sort of work 
are:

1) the edge cases are tricky to handle,

2) the simple readers like this are slow

3) I believe good error reporting needs things like the starting line number 
and/or starting byte position for the record. Implementing that is a bit tricky 
(and boring), and tracking that information in a compiled extension has a much 
lower overhead than doing it in Python.


Cheers,


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] how to report SDF records for which Chem.ForwardSDMolSupplier returns None?

2022-04-14 Thread Andrew Dalke
On Apr 14, 2022, at 09:16, Gyro Funch  wrote:
> I don't know the sdf format well, so please excuse my ignorance, but instead 
> of a custom parser, would it be possible to write a preprocessor to eliminate 
> the offending information? Perhaps something using regular expressions in 
> python, perl, sed, or awk?

The SDF format is too complicated to be parsed with a regular expression[1], 
and the failure modes often cannot be detected at the syntax level[2]. I 
suggest people may consider using chemfp for this [3].

[1] For example, in a V2000-formatted record, the number of atom records and 
the number of bond records are given by a repeat count. A traditional/formal 
regular expression does not support counts where the count from the pattern 
matching. 

Most regular expression engines have more powerful capabilities than formal 
regular expression, such as matches to back-reference captured groups. However, 
few support using a backreference as a repeat count.

I wrote one that did, which would let you specify

(?P...)(?P...) and so on)
(?P(?P.{10})(?P.{10}) and so on){atom_count}
(?P(?P.{3})(?P.{3}) and so on){bond_count}

but in practice, defining the grammar through a regular expression grammar was 
decidedly not easy!

I've wanted to experiment with using WUFFS to make a low-level SDF parser 
library, see
  https://github.com/google/wuffs


[2] For example, RDKit by default rejects atoms where the valence is too high. 
Detecting this in filter code calls for reverse-engineering what RDKit already 
does.

[3] Chemfp is best known as a fingerprint generation and search program. 
However, there are a few use cases where I wanted to have access to the input 
record (eg, to detect toolkit failures, or to add fingerprint data to the input 
record rather than round-tripping the SDF through a toolkit.) I did this by 
writing my own SDF record reader (in the "text_toolkit"), and writing a wrapper 
to the RDKit toolkit (in the "rdkit_toolkit"), and using a error handler which 
can decide how to handle errors (ignore, report, raise an exception, log, 
etc.). That error handler has access to location information, which includes 
the record number, the record text, the line number of the start of the record, 
and more.

Here's what it looks like for Giovanni's use case:


from chemfp import rdkit_toolkit as T
from chemfp import text_toolkit

filename = "/Users/dalke/databases/ChEBI_complete_3star.sdf.gz"


class ErrorHandler:
def __init__(self):
self.error_ids = []

def error(self, msg, location, extra=None):
record = location.record
chebi_id = text_toolkit.get_sdf_tag(location.record, "ChEBI ID")
print(f"!!! Error reading record {location.recno} with ID: 
{chebi_id!r}")
print(f"at {location.where()}")
self.error_ids.append(chebi_id)

errors = ErrorHandler()
count = 0
num_atoms = 0
for mol in T.read_molecules(filename, errors=errors):
count += 1
num_atoms += mol.GetNumAtoms()  # This is a RDMol.

print(f"Parsed {count} records ({num_atoms} atoms), skipped 
{len(errors.error_ids)}.")


This functionality is available in the pre-compiled version of chemfp for 
Linux-base OSes, available from https://chemfp.com/download/ . The default 
license agreement (that is, you can use it without a license key) lets you use 
it for any internal purpose.

If anyone is interested in working on a stand-alone SDF parsing library under a 
free software license, I can provide some pointers and feedback, and will 
contribute chemfp's SDF parser under the MIT license.


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] mmpdb 3.0b1

2022-02-09 Thread Andrew Dalke
Hi all,

  The combination of crowd-funding and contract work for me, and methods + 
software development by Mahendra Awale, has resulted in a new version of mmpdb.

More specifically, version 3.0 beta 1 is available on GitHub at:

  https://github.com/adalke/mmpdb/tree/v3-dev

The CHANGELOG summary is at the bottom of the email. For many people the 
biggest improvement is probably the support for large data sets, and the switch 
to a more human-understandable SMARTS/"pseudo-SMILES" for the environment 
fingerprints.

The documentation is available through the program, starting with 'mmpdb 
--help'.

Try it out, kick the tires, and let me know what fell off!

Cheers,

Andrew
da...@dalkescientific.com


A large number of changes to merge three different development tracks
and add new features.

The "fragments" file format has been replaced with a SQLite-based
"fragdb" file format. This makes it much easier to develop tools to
work on fragment data sets instead of processing a JSON-Lines file.

New functionality to create an MMP data set in a distributed compute
environment. Some of the features are:

- split a SMILES file into a set of smaller SMILES files
- the default "fragment" file output is now based on the input name
- fragment files can be re-partitioned by constant fragments:
- the "fragdb_constants" file generates fragment information
- the "fragdb_partition" create re-partitioned fragdb files
- the default "index" file output is now based on the input name
- there are tools to merge fragdb and mmpdb files into one

As a result, mmpdb can now handle significantly larger data sets.

Added support for Postgres for direct index database creation. (The
new distributed compute tools require SQLite.)

Added a new "generate" command to apply 1-cut transforms to a
structure, using MMP rules as a playbook.

Replaced the SHA256-based Morgan fingerprint signature with a
canonical SMARTS representing the Morgan fingerprint environment. This
is difficult to understand or depict, so also include a "pseudo"
SMILES that can be parsed by RDKit (if sanitize is disabled) and
drawn. The new environment fingerprint also include the SMARTS of its
parent, that is, the SMARTS with a smaller radius.

Switched to 'click' for command-line parsing, removed the vendered
version of the peewee ORM, and switched to a modern "pyproject.toml"
project configuration with a setup.cfg which declares its dependencies.

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] generating smiles using RDKit

2021-12-08 Thread Andrew Dalke
Hi Gyro,

> On Dec 8, 2021, at 11:02, Gyro Funch  wrote:
> 
> My work is in the area of toxicology and I am interested in generating SMILES 
> for molecules referred to as 'short chain chlorinated paraffins' (SCCP).
> 
> A general definition that is sometimes used is that an SCCP is given by the 
> molecular formula
> 
> C_{x} H_{2x-y+2} Cl_{y}
> 
> where
> 
> x = 10-13
> y = 3-12
> 
> and the average chlorine content ranges from 40-70% by mass.
> 
> -
> 
> Can anyone provide guidance on how to generate the list of SMILES 
> corresponding to the above rules?

Here's an alternate approach to the ones presented so far.

  https://gist.github.com/adalke/e62df8774032560fef750fa9c88b6516

Like Wim's version, it also generates the SMILES as the syntax level, though by 
default it use RDKit to generate canonical SMILES output. (use --no-canonical 
to disable the canonicalization step, which is also faster.)

Here it is with 4 carbons and 3 chlorines. 

% python sccp_smiles.py --C 4 --Cl 3
Content range not specified. Using --min-content 0.4 and --max-content 0.7.
(Cl)(Cl)Cl
CCC(Cl)C(Cl)Cl
CCC(Cl)(Cl)CCl
CC(Cl)CC(Cl)Cl
CC(Cl)C(Cl)CCl
CC(Cl)C(C)(Cl)Cl
CC(Cl)(Cl)CCCl
Cl(Cl)Cl
ClCCC(Cl)CCl

The "--C" and "--Cl" are aliases for "--min-C" and "--min-Cl"; if the maximums 
are not specified then the maximum is set to the minimum.

Here's a range using all the bells and whistles:

% time python sccp_smiles.py --min-C 10 --max-C 13 --min-Cl 3 --max-Cl 12 
--max-Cl-per-atom 2 --min-content 0.4 --max-content 0.7 --no-canonicalize > 
example.smi
2.030u 0.156s 0:02.44 89.3% 0+0k 0+0io 0pf+0w
% wc -l example.smi
  440334 example.smi

Wim reported 437001 for the same configuration. I haven't figured out if the 
difference is due simply to differences in the molecular weight values.

I couldn't canonicalize and pin down the differences in part because Wim's 
output generates SMILES strings that RDKit cannot parse:

% grep '^[(]' CSSP.smi | head -4
(Cl)C(Cl)(Cl)C(Cl)(Cl)
(Cl)C(Cl)(Cl)(Cl)C(Cl)(Cl)
(Cl)C(Cl)(Cl)(Cl)(Cl)C(Cl)(Cl)
(Cl)C(Cl)(Cl)CCC(Cl)CC(Cl)(Cl)

>>> from rdkit import Chem
>>> Chem.CanonSmiles("(Cl)C(Cl)(Cl)CCC(Cl)CC(Cl)(Cl)")
[14:31:12] SMILES Parse Error: syntax error while parsing: 
(Cl)C(Cl)(Cl)CCC(Cl)CC(Cl)(Cl)

My code isn't well tested, but perhaps enough to get you on the way.

Cheers,


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Reading text records from SDF from gzipped files

2021-11-04 Thread Andrew Dalke
Hi Tim,

  You might also consider using chemfp, which has this sort of functionality 
available through its toolkit wrapper API:

from chemfp import rdkit_toolkit as T
import itertools

with T.read_ids_and_molecules("chembl_28.sdf.gz") as reader:
  loc = reader.location
  for id, mol in itertools.islice(reader, 5):
print(f"Record: {loc.recno} ({id}) line: {loc.lineno} offsets: 
{loc.offsets}")
counts_line = loc.record.splitlines()[3]
num_atoms, num_bonds = int(counts_line[:3]), int(counts_line[3:6])
print(f"  counts line #atoms: {num_atoms} #bonds: {num_bonds}")
print(f"RDKit #atoms: {mol.GetNumAtoms()} #bonds: 
{mol.GetNumBonds()}")

The output in this case is:

Record: 1 (CHEMBL153534) line: 1 offsets: (0, 1458)
  counts line #atoms: 16 #bonds: 17
RDKit #atoms: 16 #bonds: 17
Record: 2 (CHEMBL440060) line: 43 offsets: (1458, 18699)
  counts line #atoms: 206 #bonds: 208
RDKit #atoms: 202 #bonds: 204
Record: 3 (CHEMBL440245) line: 466 offsets: (18699, 39688)
  counts line #atoms: 251 #bonds: 254
RDKit #atoms: 251 #bonds: 254
Record: 4 (CHEMBL440249) line: 980 offsets: (39688, 56050)
  counts line #atoms: 194 #bonds: 205
RDKit #atoms: 185 #bonds: 196
Record: 5 (CHEMBL405398) line: 1388 offsets: (56050, 58447)
  counts line #atoms: 27 #bonds: 30
RDKit #atoms: 27 #bonds: 30

You can also work more directly to the record tokenization level, and pass each 
record to the rdkit_toolkit wrapper:


from chemfp import text_toolkit

with text_toolkit.read_sdf_records("chembl_28.sdf.gz") as reader:
  for rec in itertools.islice(reader, 5):
mol = T.parse_molecule(rec, "sdf")
print(mol.GetProp("chembl_id"), "has", len(rec), "bytes")

which prints

CHEMBL153534 has 1458 bytes
CHEMBL440060 has 17241 bytes
CHEMBL440245 has 20989 bytes
CHEMBL440249 has 16362 bytes
CHEMBL405398 has 2397 bytes



Andrew
da...@dalkescientific.com

> On Nov 4, 2021, at 17:55, Tim Dudgeon  wrote:
> 
> Thanks Paolo, that's fantastic.
> The first option was what I needed.
> Tim
> 
> On Thu, Nov 4, 2021 at 4:36 PM Paolo Tosco  wrote:
> Hi Tim,
> 
> if you need access to the original text, you'll have to do the chunking 
> yourself, e.g.:
> 
> import gzip
> 
> def molgen(hnd):
> mol_text_tmp = ""
> while 1:
> line = hnd.readline()
> if not line:
> return
> line = line.decode("utf-8")
> mol_text_tmp += line
> if line.startswith(""):
> mol_text = mol_text_tmp
> mol_text_tmp = ""
> yield mol_text






___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] MolToSmiles atom ordering

2021-11-02 Thread Andrew Dalke
Hi Ling,

  If there are symmetries then a substructure search like will only give you 
one mapping, and that might not be the canonical mapping.

What you're looking for is the special property _smilesAtomOutputOrder


>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("O=C(NCc1cc(OC)c(O)cc1)/C=C/C(C)C")
>>> Chem.MolToSmiles(mol)
'COc1cc(CNC(=O)/C=C/C(C)C)ccc1O'
>>> mol.GetProp("_smilesAtomOutputOrder")
'[8,7,6,5,4,3,2,1,0,13,14,15,16,17,18,19,20,21,12,11,9,10,]'

Here are the atom indices of the original SMILES:

 ┌ 1 11   1 1 1 2 2
atoms│ 0 1 234 56 78 9 0 12  3456 7 8 9 0 1
 └ | | ||| || || | | ||   | | | | |
   SMILES[ O=C(NCc1cc(OC)c(O)cc1)/C=C/C(C)C


You can see the first atom of the output is a "C", which is mapped to position 
8 in the _smilesAtomOutputOrder, which is the "...C)..." in the original 
SMILES, etc.


Cheers,


Andrew
da...@dalkescientific.com


> On Nov 3, 2021, at 00:18, Ling Chan  wrote:
> 
> O.K. Problem solved. Sorry about the spam, folks.
> 
> I can use GetSubstructMatch, as follows.
> 
> # sinput is the input smiles
> # scanon is the output smiles
> 
> minput = Chem.MolFromSmiles(sinput)
> scanon=Chem.MolToSmiles(minput)
> mcanon=Chem.MolFromSmiles(scanon)
> map_forward = minput.GetSubstructMatch(mcanon)
> map_backward = mcanon.GetSubstructMatch(minput)
> 
> 
> 
> 
> Ling Chan  於 2021年11月2日週二 下午3:55寫道:
> Dear colleagues,
> 
> Just wonder if I can obtain a mapping of the atom indices upon 
> canonicalization by MolToSmiles ? I am aware that canonicalization (and hence 
> atom reordering) can be suppressed in MolToSmiles, but I do want to 
> canonicalize the output smiles.
> 
> If you are interested, here is a bit more details of my problem. For each 
> molecule, I want to delete one or two side chains, and obtain a smiles of 
> what is left. Just that I want to know what are the atoms that bonded to the 
> deleted side chains. I know, by suppressing canonicalization things will 
> work. But I would like to canonicalize the smiles so that I can know if there 
> are duplicates.
> 
> I tried marking the atoms. But I believe that properties that got carried 
> over to the output smiles, e.g. Isotope, affect the canonicalization, while 
> properties that do not affect canonicalization, e.g, IntProp, are lost upon 
> the conversion to smiles.
> 
> Thank you for your insight.
> 
> Ling
> 



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] MolToSmiles

2021-10-21 Thread Andrew Dalke



> On Oct 21, 2021, at 04:50, Ling Chan  wrote:
> 
> I got the attached sdf. When I did a MolToSmiles, it gives me the following.
> 
> >>> for m in Chem.SDMolSupplier("pdb_structures/1q6k_ligand.sdf"):
> ...   print (Chem.MolToSmiles(m))
> ... 
> [CH3:0][C:0]([CH3:0])([CH3:0])[O:0][C:0](=[O:0])[NH:0][CH:0]([CH:0]=[O:0])[CH:0]1[CH2:0][CH2:0][CH2:0][CH2:0][CH2:0]1
> 
> Just wonder why does it not give something like
> O=C(OC(C)(C)C)NC(C=O)C1C1

The terms after the atom symbol in your atom block lines are center-justified 
(or left-justified, in the 2-digit mass difference term 'dd') instead of 
right-justified.

Here's a comparison of your first atom line, compared with the ctfile spec, and 
then compared with the round-trip through RDKit:

   74.0060   -9.5770  134.8660 N  0  0  0  0  0  0  0  0  0  0  0  0<-- 
yours
x.y.z. aaaddcccssshhhbbbvvvHHHrrriiimmmnnneee   <-- spec
   74.0060   -9.5770  134.8660 N   0  0  0  0  0  0  0  0  0  0  0  0   <-- 
RDKit

Add a space after the atom symbol field ("aaa") and everything works.

What happened?

The ":0" in the SMILES string derives from the atom-atom mapping number, "mmm", 
in the SDF.

The relevant code from 
Code/GraphMol/FileParsers/MolFileParser.cpp::ParseMolFileAtomLine() is:

  if (text.size() >= 63 && text.substr(60, 3) != "  0") {
int atomMapNumber = 0;
try {
  atomMapNumber = FileParserUtils::toInt(text.substr(60, 3), true);
} catch (boost::bad_lexical_cast &) {
  std::ostringstream errout;
  errout << "Cannot convert '" << text.substr(60, 3) << "' to int on line "
 << line;
  delete res;
  throw FileParseException(errout.str());
}
res->setProp(common_properties::molAtomMapNumber, atomMapNumber);
  }

This says that if the field isn't exactly "  0" then parse it as an integer and 
store it in the atom's molAtomMapNumber.

Since your " 0 " field isn't exactly "  0", it gets converted into the atom map 
value of 0.

I don't see an explicit statement in the spec about alignment in fields. It's 
clear the spec comes from a Fortran background, so these should be interpreted 
as "I2" and "I3", and right-justified.


By the way, if you pass your file through CDK you get:

org.openscience.cdk.io.MDLV2000Reader ERROR: Error while parsing line 5:
74.0060   -9.5770  134.8660 N  0  0  0  0  0  0  0  0  0  0  0  0   -> invalid 
line length, 68:74.0060   -9.5770  134.8660 N  0  0  0  0  0  0  0  0  0  0 
 0  0
org.openscience.cdk.io.iterator.IteratingSDFReader ERROR: Error while reading 
next molecule: invalid line length, 68:74.0060   -9.5770  134.8660 N  0  0  
0  0  0  0  0  0  0  0  0  0

CDK's 
storage/ctab/src/main/java/org/openscience/cdk/io/MDLV2000Reader.java::readAtomFast()
 requires that either all characters of a field be present, or the end of line. 
Your line is 68 characters long because your last field is " 0" instead of the 
" 0 " needed to match the exact charge flag "eee".

Best regards,


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Maximum Common Substructure using SMARTS

2021-07-23 Thread Andrew Dalke
On Jul 23, 2021, at 06:42, Andrew Dalke  wrote:
> 
> No, there's no way to do that.
> 
> The best I can suggest is to go back to the original Python implementation 
> and change the code leading up to

Alternatively, since your template is small, you can brute-force enumerate all 
possible matching SMARTS patterns, and test them from largest to smallest.

I believe the following patterns are correct for your template. These are 
ordered by number of bonds, then number of atoms, then ASCII-betically. (Note: 
these many contain duplicates because Chem.MolToSmarts doesn't produce 
canonical SMARTS.)

[n,c,o]1(-S(-*)(=O)=O):[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]:1
[n,c,o](-S(-*)(=O)=O):[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]
[n,c,o](-S(-*)(=O)=O)(:[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]):[n,c,o]
[n,c,o](-S(-*)(=O)=O)(:[n,c,o]:[n,c,o]:[n,c,o]):[n,c,o]:[n,c,o]
[n,c,o](-S(-*)(=O)=O)(:[n,c,o]:[n,c,o]):[n,c,o]:[n,c,o]:[n,c,o]
[n,c,o](-S(-*)(=O)=O)(:[n,c,o]):[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]
[n,c,o]1(-S(=O)=O):[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]:1
[n,c,o](-S(=O)=O):[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]
[n,c,o](-S(=O)=O)(:[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]):[n,c,o]
[n,c,o](-S(=O)=O)(:[n,c,o]:[n,c,o]:[n,c,o]):[n,c,o]:[n,c,o]
[n,c,o](-S(=O)=O)(:[n,c,o]:[n,c,o]):[n,c,o]:[n,c,o]:[n,c,o]
[n,c,o](-S(=O)=O)(:[n,c,o]):[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]
[n,c,o](-S(-*)(=O)=O):[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]
[n,c,o](-S(-*)(=O)=O)(:[n,c,o]:[n,c,o]:[n,c,o]):[n,c,o]
[n,c,o](-S(-*)(=O)=O)(:[n,c,o]:[n,c,o]):[n,c,o]:[n,c,o]
[n,c,o](-S(-*)(=O)=O)(:[n,c,o]):[n,c,o]:[n,c,o]:[n,c,o]
[n,c,o](-S(=O)=O):[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]
[n,c,o](-S(=O)=O)(:[n,c,o]:[n,c,o]:[n,c,o]):[n,c,o]
[n,c,o](-S(=O)=O)(:[n,c,o]:[n,c,o]):[n,c,o]:[n,c,o]
[n,c,o](-S(=O)=O)(:[n,c,o]):[n,c,o]:[n,c,o]:[n,c,o]
[n,c,o](-S(-*)(=O)=O):[n,c,o]:[n,c,o]:[n,c,o]
[n,c,o](-S(-*)(=O)=O)(:[n,c,o]:[n,c,o]):[n,c,o]
[n,c,o](-S(-*)(=O)=O)(:[n,c,o]):[n,c,o]:[n,c,o]
[n,c,o](-S(=O)=O):[n,c,o]:[n,c,o]:[n,c,o]
[n,c,o](-S(=O)=O)(:[n,c,o]:[n,c,o]):[n,c,o]
[n,c,o](-S(=O)=O)(:[n,c,o]):[n,c,o]:[n,c,o]
[n,c,o](-S(-*)(=O)=O):[n,c,o]:[n,c,o]
[n,c,o](-S(-*)(=O)=O)(:[n,c,o]):[n,c,o]
[n,c,o](-S(=O)=O):[n,c,o]:[n,c,o]
[n,c,o](-S(=O)=O)(:[n,c,o]):[n,c,o]
[n,c,o](-S(-*)(=O)=O):[n,c,o]
[n,c,o]-S(-*)(=O)=O
[n,c,o](-S(=O)=O):[n,c,o]
[n,c,o]-S(=O)=O
S(-*)(=O)=O
S(=O)=O

I generated it with the following:

===
from rdkit import Chem
import itertools

# Must have the atoms marked with an atom map (the atom map value is ignored).
template = 
'[n,c,o]1(-[S:1](-*)(=[O:1])=[O:1]):[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]:[n,c,o]:1'

mol = Chem.MolFromSmarts(template)

# Figure out which bonds to keep
bond_atom_indices = []
for bond in mol.GetBonds():
if all(atom.HasProp("molAtomMapNumber") for atom in (bond.GetBeginAtom(), 
bond.GetEndAtom())):
continue
bond_atom_indices.append((bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()))

# Remove the atom maps
for atom in mol.GetAtoms():
if atom.HasProp("molAtomMapNumber"):
atom.ClearProp("molAtomMapNumber")

seen = set()

# Enumerate all possible bonds to delete (should be 2**n)
for r in range(0, len(bond_atom_indices)+1):
for delete_indices in itertools.combinations(bond_atom_indices, r):
tmp_mol = Chem.RWMol(mol)

# Remove the selected bonds
for atom1_idx, atom2_idx in delete_indices:
tmp_mol.RemoveBond(atom1_idx, atom2_idx)

# Remove any singletons. Start from the end so the indices are stable.
for atom in list(tmp_mol.GetAtoms())[::-1]:
if not atom.GetBonds():
tmp_mol.RemoveAtom(atom.GetIdx())

# Get the corresponding SMARTS
tmp_smarts = Chem.MolToSmarts(tmp_mol)

# Ensure it's singly connected
if "." in tmp_smarts:
continue

# Ensure it's unique; track the number of bonds and atoms for later 
sorting
key = (tmp_mol.GetNumBonds(), tmp_mol.GetNumAtoms(), tmp_smarts)
seen.add(key)

for num_bonds, num_atoms, smarts in sorted(seen, reverse=True):
print(smarts)
===




Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Maximum Common Substructure using SMARTS

2021-07-22 Thread Andrew Dalke
On Jul 23, 2021, at 01:01, Gustavo Seabra  wrote:
> I actually want the sulfone to be found, if it is there. My problem is that I 
> also want flexibility to change the ring atoms and still find the ring as a 
> match, while considering a match on the sulfone only if it really is there. 
> (e.g., CF3 should *not* match.) Does it make sense?

Ahh, I see.

No, there's no way to do that.

The best I can suggest is to go back to the original Python implementation and 
change the code leading up to

   https://hg.sr.ht/~dalke/fmcs/browse/fmcs.py?rev=tip#L1929

so the initial seed is the sulfone instead of an (atom, bond, atom).

Then use that to the the MCS with the sulfone, and if that fails, use RDKit's 
existing method.

I point to my repository only because that's in Python and I know it better. If 
your C++ skills are better than mine, you might change the corresponding 
implementation in RDKit.

Cheers,

Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Maximum Common Substructure using SMARTS

2021-07-22 Thread Andrew Dalke
Hi Gustavo,


> template = 
> Chem.MolFromSmarts('[a]1(-[S](-*)(=[O])=[O]):[a]:[a]:[a]:[a]:[a]:1')

Unless things have changed since I last looked at the algorithm, you can't 
meaningfully pass a SMARTS-based query molecule into the MCS program, outside 
of a few simple cases.

It generates a SMARTS pattern based on the properties of the molecule. You 
asked it to CompareElements, but those [a] terms all have an atomic number of 0.

  >>> template = 
Chem.MolFromSmarts('[a#1]1(-[S](-*)(=[O])=[O]):[a#1]:[a#1]:[a#1]:[a#1]:[a#1]:1')
  >>> [a.GetAtomicNum() for a in template.GetAtoms()]
  [0, 16, 0, 8, 8, 0, 0, 0, 0, 0]

That's why your CompareAny search returns the #0 terms, like:

  
'[#16,#6](-[#0,#6])(=,-[#8,#9])(=,-[#8,#9])-[#0,#6]1:[#0,#6]:[#0,#6]:[#0,#6]:[#0,#6]:[#0,#7]:1'

> I would appreciate some pointers on how it would be possible to find the 
> maximum common substructure of 2 molecules, where in the template structure 
> some atoms may be *any*, but some other atoms must be fixed.

Perhaps with isotope labelling?

That is, label the "any" atoms as isotope 1, and label your -[S](=[O])(=[O])- 
as -[2S](=[3O])(=[3O])-

Then use rdFMCS.AtomCompare.CompareIsotopes .

If there's anything you don't want to match at all, give each atom a unique 
isotope value.

Best regards,

Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Shape Tanimoto distance question

2021-06-30 Thread Andrew Dalke


> On Jun 30, 2021, at 04:20, Francois Berenger  wrote:
> 
> On 29/06/2021 12:26, Greg Landrum wrote:
>> Hi Leon,
>> You can convert the tanimoto distance to similarity, but the formula
>> is:
>> Similarity = 1 - Distance
> 
> In other words:
> 
> Tanimoto_distance = 1.0 - Tanimoto_score

As a side note, Rogers and Tanimoto (1960) define their distance as

 distance =  - log_2 (similarity)

The Jaccard distance is 1 - similarity

https://en.wikipedia.org/wiki/Jaccard_index#Other_definitions_of_Tanimoto_distance

Yes, this confusing set of terms is an annoying nuisance.


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] off_coverage, Z3, and test set reduction

2021-06-08 Thread Andrew Dalke
Hi all,

  I'm excited about a tool I developed for the Open Force Field Initiative and 
thought to share a bit about it here.

It's called "off_coverage", currently in a pull-request at 
https://github.com/openforcefield/cheminformatics-toolkit-equivalence and also 
available from my Sourcehut repo at https://hg.sr.ht/~dalke/off_coverage .

The overall goal was to improve test case selection. Quite a few of the tests 
use an SDF test set with 371 records. These were created to test atom type 
assignment, and re-purposed as general-purpose test cases. At the beginning 
this was fine, but now the unittests take a long time to run.

It's also likely that some of the records do not add additional testing 
strength. For example, several of them have valences that RDKit does not 
accept, and there doesn't appear useful to have multiple test cases for that.

My solution casts this into a feature space problem. There are any number of 
features: can RDKit parse the structure? Can the OpenFF toolkit convert the 
structure into its internal molecule representation? Can it convert the 
structure back to an RDKit molecule and generate a SMILES? Does the resulting 
SMILES match the original?

These might be expressed as a mapping from id to a list of features:

id123 parse_ok to_openff_ok from_openff_ok
id456 parse_ok to_openff_ok from_openff_err
id789 parse_err
id999 parse_ok to_openff_ok from_openff_ok

Assuming all useful features can be detected, test set minimization reduces to 
the set cover problem - https://en.wikipedia.org/wiki/Set_cover_problem .

These can be solved with off-the-shelf tools. I used Z3, 
https://github.com/Z3Prover/z3 , which solves it by expressing the problem as 
funding a solution to a Boolean-and of a set of Boolean-ors:

  (and
 (or id123 id456 id999); parse_ok
 id789 ; parse_err
 (or id123 id456 id999); to_openff_ok
 (or id123 id999)  ; from_openff_ok
 id456 ; from_openff_err 
  )

while minimizing sum(id123 + id456 + id789 + id999). In this case id789 and 
id456 must be selected, as well as one id123 or id999.

This general approach could be used for all sorts of things, like finding a 
reduced set of fingerprints whose union still sets all of the bits set by a 
larger set of fingerprints, selecting a subset of structures which contain all 
of the atom types in the larger structure, etc. (It could also be modified to 
require at least 2 representatives, etc.)

The novel part (to me) is that code coverage - that is, the lines of code that 
Python executes for a given test - can also be used to generate features. This 
information is available via Python's sys.settrace() hook.

For example, you might track the sequence of [(module name, line number), ] 
for every function call. If two functions have the same sequence, then they 
execute exactly the same code. This is a very strict definition of equivalence.

Or, you might track the set of lines executed for each module, and compare 
those sets. This is more like comparing the overall code coverage for a module.

Or, you might track things at the function call level, rather than the module 
level, in case a test calls a function multiple times, each time possibly 
executing a different code path.

Or you might look at the pairs of (previous line number, current line number), 
which captures some of the branching behavior.


I implemented these variations. Assuming I did it correctly, the minimal subset 
can be as small as 27 records, out of 371!

• mod-cover selected: 72/371
• mod-sequence selected: 342/371
• mod-pairs selected: 72/371
• func-cover selected: 31/371
• func-sequence selected: 336/371
• func-pairs selected: 27/371

In addition, the tool lets you specify an alternative weighting scheme. The 
default is that each entry has a weight of 1, but you might prefer to minimize 
the number of atoms, or minimize the number of features in each entry, or to 
prefer entries from one data set over another.

I designed the tool to usable even outside of the OpenFF installation. If 
you're only interested in the selection part, the "minimize" subcommand takes a 
line-oriented text file with feature information (id, feature labels, and 
feature weights) as input, and a list of selected ids as output. It depends 
only on the "z3-solver" Python package from PyPI.

I hope people find set cover tool and/or the idea of coverage-directed test set 
minimization to be interesting. And perhaps someone is interested enough in the 
latter to see if I implemented my settrace handlers correctly? ;)

Best regards,

Andrew 
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Are the path-based fingerprints formally described in the scientific literature?

2021-05-21 Thread Andrew Dalke

On May 20, 2021, at 03:17, Francois Berenger  wrote:
> Weren't the path-based FPs formally described somewhere?

What does "formally" mean?

Daylight was rarely participated in the academic literature tradition.

They instead preferred to publish their information directly, as Pat mentions:

On May 21, 2021, at 02:26, Patrick Walters  wrote:
> There's also some information on path fingerprints in the Daylight Theory 
> Manual 
> https://www.daylight.com/dayhtml/doc/theory/theory.finger.html

If you're looking for a citation, you can try something like:

 Daylight Theory Manual 4.9, Daylight Chemical Information Systems, Inc.,
 Laguna Niguel, CA.
 https://www.daylight.com/dayhtml/doc/theory/theory.finger.html


You can see this page is the source for the general understanding. Compare 
Daylight's:

The fingerprinting algorithm examines the molecule and generates
the following:

• a pattern for each atom
• a pattern representing each atom and its nearest neighbors (plus the 
bonds that join them)
• a pattern representing each group of atoms and bonds connected by 
paths up to 2 bonds long
• ... atoms and bonds connected by paths up to 3 bonds long
• ... continuing, with paths up to 4, 5, 6, and 7 bonds long.

 ...
the output of which is a set of bits (typically 4 or 5 bits
per pattern); the set of bits thus produced is added (with a
logical OR) to the fingerprint.


with, for example, the textbook "An Introduction to Chemoinformatics" by Leach 
and Gillet:

   It is produced by generating all possible linear paths of connected
   atoms through the molecule containing between one and a defined number
   of atoms (typically seven).

...

   Each of these paths in turn serves as the input to a second program
   that uses a hashing procedure to set a small number of bits (typically four
   or five) to “1” in the fingerprint bitstring.

The "typically" in the latter is because that's Daylight's algorithm's default.


What I've learned, in researching this detail, is that Daylight's "typically 4 
or 5 bits per pattern" used a size 'n' which was a function of the path length. 
That detail was known to a few Daylight users that I've talked to, and I've 
been told it was guided by information theoretical reasoning.

In RDKit and Open Babel, 'n' is fixed for the given fingerprint type.

In Indigo, which includes non-linear subgraphs, 'n' is a function of both the 
size and the internal symmetry. The Indigo authors said that was more effective 
in their experiments. (I would have to dig up old emails to confirm that 
memory.)

I know of no papers which explore this detail. I always figured it would be a 
good Master's paper for someone interested in old-school cheminformatics.

Cheers,


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] How to prevent a SMILES from starting with a specific atom?

2021-05-11 Thread Andrew Dalke
On May 12, 2021, at 05:08, Francois Berenger  wrote:
> Or, more generally, flag a given atom in a molecule
> and ask rdkit to not start the corresponding SMILES with
> this atom, any unflagged atom being fine.

Perhaps do the opposite and use rootedAtAtom to have RDKit start with a 
specific atom that you know is fine?

https://www.rdkit.org/docs/source/rdkit.Chem.rdmolfiles.html#rdkit.Chem.rdmolfiles.MolToSmiles

   • rootedAtAtom: (optional) if non-negative, this forces the SMILES to start
   at a particular atom. Defaults to -1.


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] rejoining pairs of fragments after fragmenting a molecule

2021-04-02 Thread Andrew Dalke
Hi Ling,

> On Apr 2, 2021, at 16:23, Ling Chan  wrote:
> 
> Thank you Francois, I took a look at your code and borrowed parts of it to 
> rejoin two molecules. It seems like my problem is solved. I eventually 
> arrived at something like example 4 in 
> https://www.programcreek.com/python/example/123334/rdkit.Chem.CombineMols
> (which I discovered a bit late).
> 
> Still, I am not sure if the code is safe. In particular, I wonder if the 
> following conditions are always valid.
>   • Chem.CombineMols simply concatenates the atomic indices from the 
> input molecules.
>   • The Chem.EditableMol constructor preserves atom ordering from the 
> input.
>   • RemoveAtom in EditableMol results in all indices above the deleted to 
> decrease by one, i.e. atom ordering is preserved.

I've found that it's very hard to work with molecular graphs and preserve 
stereochemistry.

Consider F/C=C/Cl breaking on the first bond, and the code I pointed you to.

FragmentOnBonds() using '9' as the labels gives:  [9*]/C=C/Cl.[9*]F

My "smiles_weld" code converts that to: CC\%99=C/Cl.F%99 which can be 
re-canonicalized to the original: F/C=C/Cl .

Or, with F[C@H](Cl)Br again, breaking on the first bond.

FragmentOnBonds() gives [9*]F.[9*][C@H](Cl)Br

smiles_weld converts that to F%99.[C@@H]%99(Cl)Br which is re-canonicalized as  
F[C@H](Cl)Br

Handling this correctly in the molecule API requires paying careful attention 
to the bond direction, and bond attachment order around the atom, which changes 
with RemoveAtom() calls. I didn't see stereochemistry support in Francois's 
"bind_molecules()" nor in the connect_mols() at 
https://github.com/molecularsets/moses/blob/master/moses/baselines/combinatorial.py
 (one of the examples from the programcreek.com link you gave).

If you don't need to support or preserve stereochemistry, then of course 
there's no problem.

Cheers,

Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] rejoining pairs of fragments after fragmenting a molecule

2021-04-02 Thread Andrew Dalke
On Mar 31, 2021, at 21:55, Ling Chan  wrote:
> I am trying to do something that I think is quite simple, but I have not 
> figured out a simple way. Don't know if I am missing something. I am sure 
> that ultimately I can figure it out, but I wonder if there is a good way.

If you can work in SMILES space rather than molecule space, then try:

   http://dalkescientific.com/smiles_weld.py

It's derived from a technique I developed for the mmpdb package. I called it 
'welding' the SMILES strings.

What I do is convert the wildcards into closures, then let RDKit merge the 
closures. (There are a few tricky parts, like support for double-bond stereo 
chemistry.)

Here's an example, where I use a dictionary to tell the program that [1*] 
should be bonded to [2*].

>>> from rdkit import Chem
>>> smi = "N#Cc1ccncc1"
>>> mol = Chem.MolFromSmiles(smi)
>>> frag_mol = Chem.FragmentOnBonds(mol, [1])
>>> frag_smi = Chem.MolToSmiles(frag_mol)
>>> frag_smi
'[1*]c1ccncc1.[2*]C#N'
>>> import smiles_weld
>>> smiles_weld.convert_wildcards_to_closures(frag_smi, {1: 1, 2: 1})
'c%991ccncc1.C%99#N'
>>> Chem.CanonSmiles('c%991ccncc1.C%99#N')
'N#Cc1ccncc1'

If you use matching dummy labels then you can omit the conversion table:

>>> frag_mol = Chem.FragmentOnBonds(mol, [1], dummyLabels=((4,4),))
>>> frag_smi = Chem.MolToSmiles(frag_mol)
>>> frag_smi
'[4*]C#N.[4*]c1ccncc1'
>>> smiles_weld.convert_wildcards_to_closures(frag_smi)
'C%99#N.c%991ccncc1'
>>> Chem.CanonSmiles('C%99#N.c%991ccncc1')
'N#Cc1ccncc1'

Note: while the mmpdb code is well-tested, I modified it this morning to handle 
what I think you want, and I haven't fully tested the new code.

The program assumes the SMILES is a canonical SMILES generated by RDKit, and 
that the wildcard labels don't have a charge, hydrogen count, or other 
attribute.


Cheers,
Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] inter-classes Tanimoto similarity

2021-03-14 Thread Andrew Dalke
> On Mar 13, 2021, at 20:29, Marawan Hussien via Rdkit-discuss 
>  wrote:
> my question is if this is the valid approach of comparison, particularly if 
> the class sizes vary widely and the average similarity will be inevitably 
> affected by the size of each item in each pair. As a check, it looks that the 
> diagonal is having the highest inter-classes similarity overall, which is 
> anyway expected.
> 
> I am also wondering if a size-weighted normalization approach could handle 
> this situation?

What about a Z-score? That is:

zscore = (score - background_score) / background_standard_deviation

rather than using the mean score.

I worked out something like that a few years ago, using chemfp, at 
http://www.dalkescientific.com/writings/diary/archive/2017/03/27/chembl_target_sets_association_network.html
 .

If that's a reasonable approach, then it could all be done in RDKit, if you 
don't want to use chemfp.

Best regards,


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] chemfp 3.5.1 and ChEMBL 27 FPB distributions

2021-02-05 Thread Andrew Dalke
Hi all,

  I've just released chemfp 3.5.1 with support for "licensed
FPB files". These are fingerprint datasets which can be used
under the terms of chemfp's base license agreement even without
a chemfp license key or source code distribution.

As the first (and so far only) data set, I've converted the
RDKit Morgan fingerprints from the ChEMBL 27 release into FPB
format and made it available from https://chemfp.com/datasets/ .

If you are on a Linux-based OS and RDKit is already installed
then here are the steps to get started:

1) Install a pre-compiled version of chemfp for Linux:

  python -m pip install chemfp -i https://chemfp.com/packages/

2) Download the ChEMBL data set in FPB format using one of
the following:

  wget https://chemfp.com/datasets/chembl_27.fpb.gz
-or-
  curl -O https://chemfp.com/datasets/chembl_27.fpb.gz

3) (Optional but recommended) Uncompress it:

gunzip chembl_27.fpb.gz

4) Do a similarity search, for example, with a query SMILES
or query file:

simsearch --query c1c1O chembl_27.fpb
simsearch --queries queryfile.sdf chembl_27.fpb

5) View the ChEMBL license agreement and legal notices
included with the dataset.
 
python -m chemfp fpb_text chembl_27.fpb


Chemfp is a Python package for cheminformatics fingerprints.

It can be used to:
  - generate fingerprints using the  RDKit, Open Babel,
  CDK, and OEChem/OEGraphSim toolkits;
  - extract pre-computed fingerprints from SDF tag data;
  - do high-performance Tanimoto and Tversky similarity search;
  - integrate fingerprints and search results with NumPy/SciPy.
  - ... and much more!

It includes an extensive and well-documented Python API for
working with fingerprints and a set of command-line tools
for fingerprint generation, conversion, and similarity search.

Chemfp natively supports two fingerprint file formats. The
FPS format is a text format which is easy to read and write.
The FPB format is a binary format which is quick to load.

For more information see https://chemfp.com/ .

License keys for academic use are available at no cost.

Best regards,

Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Partial substructure match?

2020-11-23 Thread Andrew Dalke
0.409	0.333	0.265	0.200
 CHEBI:1148	7	6	6	5	0.273	0.208	0.261	0.200
 CHEBI:1734	19	21	16	15	0.727	0.625	0.640	0.500
 CHEBI:1895	9	9	9	8	0.409	0.333	0.409	0.320
 ...

Finally, nearly all of the MCS parameters can be configured on the command-line.

This program was written by Andrew Dalke  .
"""

import sys
import argparse

from rdkit import Chem
from rdkit.Chem import rdFMCS

def get_chemfp_rdkit_toolkit(option):
try:
import chemfp
except ImportError:
sys.stderr.write("chemfp must be installed to use the %s argument.\n" % (option,))
sys.stderr.write("Install it with:\n")
sys.stderr.write("  python -m pip install chemfp -i https://chemfp.com/packages\n;)
raise SystemExit(10)
if not hasattr(chemfp, "__version_info__") or chemfp.__version_info__[0] < 3:
sys.stderr.write("A newer version of chemfp must be installed to use the %s argument." % (option,))
sys.stderr.write("Upgrade your version with:\n")
sys.stderr.write("  python -m pip install chemfp -i https://chemfp.com/packages --upgrade\n")
raise SystemExit(10)
from chemfp import rdkit_toolkit
return rdkit_toolkit



parser = argparse.ArgumentParser(
description="Compute MCS similarity scores between a query and target structure or structures",
epilog = EPILOG,
formatter_class=argparse.RawDescriptionHelpFormatter,
)

# Query / target
parser.add_argument("--query", metavar="SMILES",
help="default query SMILES (default: naphthalene)")
target_group = parser.add_mutually_exclusive_group()
target_group.add_argument("--target", metavar="SMILES",
help="default target SMILES (default: phenol)")
target_group.add_argument("--targets", metavar="FILENAME",
help="read target structures from FILENAME; use '-' for stdin")

# MCS arguments
parser.add_argument("--maximize-atoms", action="store_true",
help="maximize the number of atoms (the default maximizes the number of atoms")
parser.add_argument("--match-valences", action="store_true",
help="atom matches must have the same valence")
parser.add_argument("--ring-matches-ring-only", action="store_true",
help="ring atoms must only match ring atoms")
parser.add_argument("--complete-rings-only", action="store_true",
help="only allow complete rings to match")
parser.add_argument("--match-chiral-tag", action="store_true",
help="atom matches must have the same chiral tag")
parser.add_argument("--atom-compare", choices=("Any", "AnyHeavyAtom", "Elements", "Isotopes"),
default="Elements",
help="method for comparing two atoms (default: Elements)")
parser.add_argument("--bond-compare", choices=("Any", "Order", "OrderExact"),
default="Order",
help="method for comparing two bonds (default: Order")
parser.add_argument("--ring-compare", choices=("IgnoreRingFusion", "PermissiveRingFusion", "StrictRingFusion"),
default="IgnoreRingFusion",
help="method for handling ring fusion (default: 'IgnoreRingFusion'")

_atom_comparisons = {
"Any": rdFMCS.AtomCompare.CompareAny,
"AnyHeavyAtom": rdFMCS.AtomCompare.CompareAnyHeavyAtom,
"Elements": rdFMCS.AtomCompare.CompareElements,
"Isotopes": rdFMCS.AtomCompare.CompareIsotopes,
}
_bond_comparisons = {
"Any": rdFMCS.BondCompare.CompareAny,
"Order": rdFMCS.BondCompare.CompareOrder,
"OrderExact": rdFMCS.BondCompare.CompareOrderExact,
}
_ring_comparisons = {
"IgnoreRingFusion": rdFMCS.RingCompare.IgnoreRingFusion,
"PermissiveRingFusion": rdFMCS.RingCompare.PermissiveRingFusion,
"StrictRingFusion": rdFMCS.RingCompare.StrictRingFusion,
}

# Configure how to parse the query and target(s)
parser.add_argument("--query-format", metavar="FORMAT",
help="query format (default is 'smi' for SMILES)")
parser.add_argument("--target-format", metavar="FORMAT",
help="target format (default: based on filename extension, defaults to SMILES")
parser.add_argument("--id-tag", metavar="STR",
h

Re: [Rdkit-discuss] Morgan FP atom numbering

2020-10-27 Thread Andrew Dalke
On Oct 26, 2020, at 17:41, Cyrus Maher  wrote:
> I’m wondering if there is an easy way to retrieve the atom numbers that the 
> morgan fingerprints algorithm assigns as its first step.

Many of the fingerprint function support an optional "bitInfo" parameter. If 
it's a dictionary then the keys are the bit that was set, and the value is at 
tuple of the (atom index, radius) which set it.

Here's an example with theobromine using r=0, which lets you see the initial 
invariants:

>>> from rdkit import Chem
>>> from rdkit.Chem import rdMolDescriptors
>>> mol = Chem.MolFromSmiles("Cn1cnc2c1c(=O)[nH]c(=O)n2C")
>>> bitInfo = {}
>>> fp = rdMolDescriptors.GetMorganFingerprintAsBitVect(mol, radius=0, 
>>> useFeatures=1, bitInfo=bitInfo)
>>> for bitno, pairs in sorted(bitInfo.items()):
...   print(f"Bitno: {bitno}")
...   for atom_idx, r in pairs:
... print(f"  atom {atom_idx} ({mol.GetAtomWithIdx(atom_idx).GetSymbol()}) 
with radius {r}")
...
Bitno: 0
  atom 0 (C) with radius 0
  atom 12 (C) with radius 0
Bitno: 2
  atom 7 (O) with radius 0
  atom 10 (O) with radius 0
Bitno: 4
  atom 2 (C) with radius 0
  atom 4 (C) with radius 0
  atom 5 (C) with radius 0
  atom 6 (C) with radius 0
  atom 9 (C) with radius 0
Bitno: 5
  atom 8 (N) with radius 0
Bitno: 6
  atom 1 (N) with radius 0
  atom 3 (N) with radius 0
  atom 11 (N) with radius 0

If I follow the code correctly, when useFeatures == 1 then the intial 
invariants are set by getFeatureInvariants() in 
./Code/GraphMol/Fingerprints/FingerprintUtil.cpp , available at:

https://github.com/rdkit/rdkit/blob/d594998dda2803a15aa0066e06f2477b71ed3be6/Code/GraphMol/Fingerprints/FingerprintUtil.cpp#L221

A few lines up, at 
https://github.com/rdkit/rdkit/blob/d594998dda2803a15aa0066e06f2477b71ed3be6/Code/GraphMol/Fingerprints/FingerprintUtil.cpp#L182
 , you'll see the features patterns defined in smartsPatterns

They appear to be identical to the list you gave.

I reimplemented the initialization function (copied at the end of this email). 
Running the program shows that it produces the same invariants which are used 
as the bit numbers in the Morgan feature fingerprint:

Invariant: 0
  atom 0 (C)
  atom 12 (C)
Invariant: 2
  atom 7 (O)
  atom 10 (O)
Invariant: 4
  atom 2 (C)
  atom 4 (C)
  atom 5 (C)
  atom 6 (C)
  atom 9 (C)
Invariant: 5
  atom 8 (N)
Invariant: 6
  atom 1 (N)
  atom 3 (N)
  atom 11 (N)


I believe that gives you two ways to get the information you want!

Best regards,

Andrew
da...@dalkescientific.com




# Python re-implementation of RDKit's getFeatureInvariants() from
# ./Code/GraphMol/Fingerprints/FingerprintUtil.cpp

from rdkit import Chem

smartsPatterns = [
"[$([N;!H0;v3,v4&+1]),\
$([O,S;H1;+0]),\
n&+0]",  # Donor
"[$([O,S;H1;v2;!$(*-*=[O,N,P,S])]),\
$([O,S;H0;v2]),\
$([O,S;-]),\
$([N;v3;!$(N-*=[O,N,P,S])]),\
n&+0,\
$([o,s;+0;!$([o,s]:n);!$([o,s]:c:n)])]",# Acceptor
"[a]",  # Aromatic
"[F,Cl,Br,I]",  # Halogen
"[#7;+,\
$([N;H2&+0][$([C,a]);!$([C,a](=O))]),\
$([N;H1&+0]([$([C,a]);!$([C,a](=O))])[$([C,a]);!$([C,a](=O))]),\
$([N;H0&+0]([C;!$(C(=O))])([C;!$(C(=O))])[C;!$(C(=O))])]",  # Basic
"[$([C,S](=[O,S,P])-[O;H1,-1])]"# Acidic
]

mol = Chem.MolFromSmiles("Cn1cnc2c1c(=O)[nH]c(=O)n2C")
invariants = [0] * mol.GetNumAtoms()
for pattern_idx, smartsPattern in enumerate(smartsPatterns):
pat = Chem.MolFromSmarts(smartsPattern)
for (atom_idx,) in mol.GetSubstructMatches(pat):
invariants[atom_idx] |= (1

Re: [Rdkit-discuss] sd file format question

2020-10-02 Thread Andrew Dalke
ou can also use the --verify flag to generate and compare the SMILES strings 
before and after the conversion.

Best regards,


Andrew
da...@dalkescientific.com
# Copy charges from the "M  CHG" data lines to the atom block
# Written by Andrew Dalke, 2 October 2020
import argparse
import sys
import gzip

# This requires chemfp. See https://chemfp.com/license/ for licenses.
# The default license is the "chemfp Base License"
from chemfp import text_toolkit # For Linux: python -m pip install chemfp -i https://chemfp.com/packages/

# From the documentation:
#   0 = uncharged or value other than these, 1 = +3, 2 = +2, 3 = +1,
#   4 = doublet radical, 5 = -1, 6 = -2, 7 = -3
_chg_to_ccc_table = {
+3: 1,
+2: 2,
+1: 3,
-1: 5,
-2: 6,
-3: 7,
}

class ChargeError(Exception):
pass

def set_atom_block_charges(record, verify=False):
lines = record.splitlines(True)
num_atoms = int(lines[3][:3])
num_bonds = int(lines[3][3:6])

for line in lines[3 + num_atoms + num_bonds:]:
if line.startswith(b"M  END"):
break

if not line.startswith(b"M  CHG"):
continue

# Process each charge assignment on this line
try:
num_fields = int(line[6:9])

for field_index in range(num_fields):
offset = 10 + 8*field_index
atom_index = int(line[offset:offset+3])
chg = int(line[offset+4:offset+7])

# 0 means uncharged or unsupported charge
ccc = _chg_to_ccc_table.get(chg, 0)

# Update the correct atom line to have the charge
atom_line = lines[3 + atom_index]
atom_line = b"%s%3d%s" % (atom_line[:36], ccc, atom_line[39:])
lines[3 + atom_index] = atom_line

except (ValueError, IndexError) as err:
# This shouldn't happen unless I did something wrong.
raise ChargeError("Unable to process charges in %r" % (line,))

if verify:
# Remove the "M  CHG" line
i = lines.index(line)
lines[i] = b""

return b"".join(lines)

### Handle command-line processing

parser = argparse.ArgumentParser(
description="copy charge information from the 'M  CHG' data line to the atom block")
parser.add_argument("--output", "-o", metavar="FILENAME",
help="output file name (default: stdout)")
parser.add_argument("--roundtrip", action="store_true",
help="use RDKit to parse the record and regenerate the SDF record")
parser.add_argument("--verify", action="store_true",
help="ensure the input and output SMILES match")
parser.add_argument("--no-set", action="store_true",
help="don't set the charges (useful if you want to see the round-trip output)")
parser.add_argument("input", nargs="?", metavar="FILENAME",
help="input filename (default: stdin)")


def open_binary_output(filename):
if filename is None:
return getattr(sys.stdout, "buffer", sys.stdout) # support Python 2 and 3
elif filename.endswith(".gz"):
return gzip.open(filename, "wb")
else:
return open(filename, "wb")

def main():
args = parser.parse_args()

verify = args.verify
roundtrip = args.roundtrip
no_set = args.no_set

if verify or roundtrip:
from chemfp import rdkit_toolkit as T


with text_toolkit.read_sdf_records(args.input) as record_reader:
outfile = open_binary_output(args.output)

for record in record_reader:
if roundtrip:
mol = T.parse_molecule(record, "sdf", errors="ignore")
if mol is None:
sys.stderr.write("Warning: Could not process molecule record: %s -- skipping\n" % (
record_reader.location.where(),))
continue
record = T.create_bytes(mol, "sdf")

if verify:
# Get a reference SMILES string
smi1 = T.create_string(T.parse_molecule(record, "sdf"), "smistring")

if no_set:
output = record
else:
try:
output = set_atom_block_charges(record, verify)
except ChargeError as err:
sys.stderr.write("ERROR: %s: %s\n" % (err, record_reader.location.where()))
raise SystemExit("

Re: [Rdkit-discuss] Smallest possible size of 100*1e6 morgan fingerprints for storage and memory

2020-09-08 Thread Andrew Dalke
On Sep 9, 2020, at 04:00, Lewis Martin  wrote:
> I'd like to keep it FOSS since its for academic publication and hopefully to 
> be re-used. Chemfp is amazing but brute-forcing 100million by 100million 
> would surely still take a long time compared with an approximate nearest 
> neighbor approach. 

As a rough guideline, the chemfp paper Fig. 3 says that searching 100M PubChem 
fingerprints for the k=10 neighbors averages about 100 ms on a Haswell machine 
(See https://jcheminf.biomedcentral.com/articles/10.1186/s13321-019-0398-8 ).

That was for chemfp 3.3 using 881 bits. You have 512 bits, which is faster 
(0.6x time, according to my timings). Chemfp 1.6 is about the same performance 
as chemfp 3.4 for this case.

So that's 60ms per query, on average, or 70 days for a single-processor query.

(It's hard to compare directly though as the BitBound algorithm is sublinear; I 
found the PubChem fingerprint kNN search scales at O(n** ~0.6) while Morgan 
2048-bit scales at O(n** ~0.8). Then again, the Haswell machine is several 
years old, so I'm leaving the numbers as-is for an estimate.

This is easily parallelized yourself, or you can use chemfp's built-in 
parallelized NxN kNN searches. (The amount of scaleup on a single multi-core 
machine depends on your memory architecture.)

This means that it shouldn't be hard to get a <1 week performance out of chemfp.

Performance-wise, chemfp 3.4.1 (the latest version of the commercial track) is 
the same as chemfp 1.6.1 (the latest *no-cost*/open-source version).

One caveat - chemfp 1.x has a limit of 2**31 bytes for the fingerprints, or 
33.6M 512-bit fingerprints. (chemfp 2.x or later use 64-bit indexing and don't 
have this limitation.) To get around that, with 1.6.1, you'll need split your 
data into 4 chucks, and search each independently. This also reduces the effect 
of sublinear scaling.

Or, there's no-cost academic licensing for chemfp 3.x using pre-compiled 
binaries which work on most Linux-based OSes.

Still, I think an exact search is entirely doable using FOSS software.

Switching to a smaller fingerprint size is a form of approximate nearest 
neighbors.

If you take Greg's observation and drop the fingerprint size from 512 bits to 
128 bits then you can get about a factor of 3 faster, and have everything fit 
into chemfp 1.6's 2GB limit.

Best regards,

Andrew
da...@dalkescientific.com


P.S.

On Wed, Sep 9, 2020 at 11:29 AM Francois Berenger  wrote:
> There are many published methods, some open-source software (like 
> Dalke's chemfp) and even some commercial ones
> which claim they are lightning fast (even reaching real-time search 
> speed!).

Dalke's chemfp also claims real-time search speed. ;)

It's always a claim of what size that means. 

Also, at least a couple of those systems use multiple threads to speed up 
single-query search, which interferes with multi-threaded batch search, as with 
NxN search on a single machine.




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Rdkit-discuss] MACCS keys - revisited

2020-09-08 Thread Andrew Dalke
On Sep 8, 2020, at 14:30, Mike Mazanetz  wrote:

> Does anyone know whether it’s possible to obtain not just a fingerprint keys 
> for MACCS (binary values) but the number of occurrences of the keys, 
> particularly these details:

The SMARTS patterns for most of the MACCS keys is available by:

>>> from rdkit.Chem import MACCSkeys
>>> for key, smarts in MACCSkeys.smartsPatts.items():
...   print("[%s] %s" % (key, smarts))
...
[1] ('?', 0)
[2] ('[#104]', 0)
[3] ('[#32,#33,#34,#50,#51,#52,#82,#83,#84]', 0)
[4] ('[Ac,Th,Pa,U,Np,Pu,Am,Cm,Bk,Cf,Es,Fm,Md,No,Lr]', 0)
[5] ('[Sc,Ti,Y,Zr,Hf]', 0)
[6] ('[La,Ce,Pr,Nd,Pm,Sm,Eu,Gd,Tb,Dy,Ho,Er,Tm,Yb,Lu]', 0)
 ...

There are two parts to the right-hand-side: SMARTS pattern and count.

If the SMARTS pattern is a "?", that means the pattern is not defined at the 
SMARTS level.

There must be at least count+1 matches. That is, if the count is 0 then there 
must be at least one match.

You write "the number of occurrences of the keys".

I don't know how that makes sense for all the keys. You have things like:

140: (key(164)-3 if key(164)>3; else 0)
141: (key(160)-2 if key(160)>2; else 0)
142: (key(161)-2 if key(161)>1; else 0)

These correspond to RDKit's definitions:

[140] ('[#8]', 3)
[141] ('[CH3]', 2)
[142] ('[#7]', 1)

How do you count those number of occurrences?


> On Sep 8, 2020, at 21:56, Mike Mazanetz  wrote:
>  The KNIME node does a lot of double counting for the RDKit Substructure 
> Counter, so it’s not a useful tool for counting MACCS keys.

Something like [11] ('*1~*~*~*~1', 0) has many matches due to symmetry.

You have to decide if you think this should be counted once, or if all 8 
matches should be counted. 

The molecule method 'GetSubstructMatches()' has a uniquify option; by default 
it only returns unique counts. ("Unique" is based on unique atoms, not unique 
atoms and bonds. I don't think that distinction affect the MACCS patterns.)

Regards,

Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] chemfp 1.6 and 3.4 releases

2020-06-25 Thread Andrew Dalke
On Jun 25, 2020, at 16:27, Andrew Dalke  wrote:
> 
> See https://chemfp.com/license/ for details, or to get started:
> 
>   python -m pip install chemfp -i https://chemp.com/packages/ --upgrade

That should be

  python -m pip install chemfp -i https://chemfp.com/packages/ --upgrade

Oops!


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] chemfp 1.6 and 3.4 releases

2020-06-25 Thread Andrew Dalke
Hi RDKit'ers,

  I've just released new versions of chemfp. Version 1.6 is the no-cost/open 
source version, and 3.4 is the commercial version. 

The goal of chemfp 1.6 is to provide a good performance baseline for evaluating 
new Tanimoto search programs. This release is about 10-20% faster than chemfp 
1.5. I believe it is the fastest no-cost/open source search program available 
for CPUs. It can be installed with:

   python -m pip install chemfp --upgrade

However, it only supports Python 2.7, which means it's only useful for legacy 
RDKit installations.

On the other hand, chemfp 3.4 supports Python 2.7 and Python 3.6+, is faster, 
and has many more features than chemfp 1.6, including a "toolkit" I/O 
abstraction to handle reading/writing structures to/from a file or a string. 
The comprehensive documentation is at:

   https://chemfp.readthedocs.io/en/latest/

See the "What's New" section for a list of changes over time.

Chemfp 3.4 is available with academic and commercial (proprietary) licenses, 
both for pre-compiled binary packages which work on most Linux-based OSes, and 
for an option also get source code access. Open source licensing is still 
available, as the most expensive option.

With this release I am making those pre-compiled packages available for anyone 
to download under a base license agreement for in-house use, which limits or 
restricts access to some features:

• fingerprint arenas may not be larger than 50,000 fingerprints;
• in-memory arena searches may not have more than 50,000 queries or 
targets;
• FPS searches may not have more than 20 queries;
• Tversky search is disabled;
• writing FPB files is disabled.

These can be enabled with a valid license key.

There's a special waiver which lets you use this version of chemfp to generate 
FPS files for any purpose, including commercial use. (This also gives a way for 
chemfp 1.6 to be a useful benchmark baseline even with new data sets and 
fingerprint types.)

See https://chemfp.com/license/ for details, or to get started:

   python -m pip install chemfp -i https://chemp.com/packages/ --upgrade

A one-year no-cost license key for the pre-compiled packages is also now 
available for academics.

Contact me if you have questions or are interested in evaluating or licensing 
chemfp.
 

Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Number of sp3 atoms

2020-05-31 Thread Andrew Dalke
On May 31, 2020, at 15:23, Chris Swain via Rdkit-discuss 
 wrote:
> I’d like to include the number of sp3 atoms, is there an easy way to do this?

I don't easily see a function for that. There's 
rdMolDescriptors.CalcFractionCSP3() which "returns the fraction of C atoms that 
are SP3 hybridized".

You can do it yourself by looking at the atom's hybridization:


>>> mol = Chem.MolFromSmiles("CN1C=NC2=C1C(=O)N(C(=O)N2C)C")
>>> sum((a.GetHybridization() == Chem.HybridizationType.SP3) for a in 
>>> mol.GetAtoms())
3

Cheers,

Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SMILES/SMARTS codes that match multiple atoms

2020-02-08 Thread Andrew Dalke
On Feb 8, 2020, at 17:55, Janusz Petkowski  wrote:
> 
> If not how can I match cases where in a given position there can be C or H 
> with rdkit?

I believe you should use #1 instead of H.


>>> from rdkit import Chem
>>> mols = [Chem.MolFromSmiles(s) for s in ["C(=O)OC", "C(=O)OCC", "C(=O)OCCC"]]
>>> hmols = [Chem.AddHs(mol) for mol in mols]


  Your pattern:

>>> pat1 = Chem.MolFromSmarts("[H]C(=O)OC([C,H])([H])[H]")
>>> [mol.HasSubstructMatch(pat1) for mol in hmols]
[False, True, True]

  Using #1 instead of H:

>>> pat2 = Chem.MolFromSmarts("[H]C(=O)OC([C,#1])([#1])[#1]")
>>> [mol.HasSubstructMatch(pat2) for mol in hmols]
[True, True, True]


"H" has an odd interpretation. 
https://www.daylight.com/dayhtml/doc/theory/theory.smarts.html says:

Note that atomic primitive H can have two meanings,
implying a property or the element itself. [H] means
hydrogen atom. [*H2] means any atom with exactly
two hydrogens attached

I believe the goal of having [H] match a hydrogen atom is to allow a SMILES, 
when interpreted as a SMARTS, to be able to match the SMILES when interpreted 
as a molecule. I'm not sure about that though.

Cheers,

Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] last call for mmpdb funding

2020-01-22 Thread Andrew Dalke
Hi all,

  This is the last email I'll send asking for people and organizations to join 
the current mmpdb crowdsourcing effort.

I've discussed it several times before here. In summary, I'm looking for 
crowdfunding for the matched molecular pair program 'mmpdb'. This is part of a 
test to find alternative ways to raise money for open source development in 
cheminformatics. See http://mmpdb.dalkescientific.com/ for details.

Currently I've pleased to say the effort has raised EUR 17 500. This is enough 
to fully finish off Postgres support and the new 'proprulecat', and pay back 
for the time taken to organize this funding effort.

In addition, it passed the EUR 16 000 goal, which means that next month I'll 
work change mmpdb so it stores the environment as a more easily interpretable 
fragment SMILES, rather than a hashed version of the circular environments.

The next funding goal is EUR 23 000, which is EUR 5 500 away. If I reach that 
funding goal, I will commit make a public/no-cost release of the new code in 
Oct. 2020.
 
I've extended the deadline to join to 15 February because some additional 
marketing will go out at the end of this month, and I want to give recipients a 
chance to participate.

People can still purchase mmpdb after the deadline is reached. Those goals 
merely represent specific commitments from me to work on mmpdb.

For those interested in budget, or who think that EUR 17 500 is already a large 
amount of money.

EUR 17 500 is corporate income, not salary income. About 50% of that goes to 
payroll taxes. The average software developer salary here in Sweden is about 
EUR 50 000, so EUR 9 000 is about 2 months of time. (I would be making above 
average salary should I decide to go corporate.) I spent several weeks working 
on the web site - which is essential marketing for crowdsourcing - and paid a 
web designer to help. I've also spent a few weeks working on improvements 
already delivered to customers. Even invoicing takes time.

But the goal of this effort isn't for me to be rich - though that would be 
nice. It's to see if an open-source project like mmpdb can be economically 
self-sustaining though funding by users interested in paying for specific new 
features and support.

Best regards,

Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] acepentalene aromaticity perception

2020-01-22 Thread Andrew Dalke
On Jan 22, 2020, at 14:12, Greg Landrum  wrote:
> As an aside: it's not particularly relevant to this discussion, but I don't 
> understand why the wikipedia page says that the compound is anti-aromatic. I 
> think the standard definition of anti-aromaticity (agrees with the one linked 
> to from the acepentalene page) requires the ring system to have 4n electrons. 
> That definitely doesn't apply here to either the individual rings or the 
> system as a whole. The system as a whole has 10 electrons (4n+2), the 
> individual rings each have 5 (neither aromatic nor anti-aromatic), and the 
> outer envelope has 9 (again, neither aromatic nor anti-aromatic).

Because I didn't know either, I looked into it.

I think that's because (to quote "Towards experimental determination of conical 
intersection properties:a twin state based comparison with bound excited 
states", Phys. Chem. Chem. Phys., 2011,13, 11872–11877 [*] )

> A Hückel MO analysis[21] leads to the conclusion that the ground state of the 
> conjugated tricyclic acepentalene I is a triplet state. DFT calculations 
> corrected this picture and showed a singlet global minimum distorted to C_s 
> symmetry with alternated single and double bonds,[22] which are well 
> described by the Lewis structures A(B,C). According to a B3LYP/6-31G* 
> calculation the lowest triplet state has also a high symmetric C_3v 
> configuration and lies 3.9 kcal/mol above the singlet ground state minimum. 
> Acepentalene I was characterized as an antiaromatic system [23] despite being 
> formally an aromatic 10 electron system: the resonance between each pair of 
> Kekule structures in this case involves only 4 electron pairs of the 
> pentalene fragments and it averts the resonance with the additional fifth 
> electron pair common for both the structures. Such a resonance is described 
> as an anti-combination of two Kekule structures: (A–B), (C–B) and (C–A).

Just need to add B3LYP/6-31G* calculations to RDKit's aromaticity perception 
algorithm and everything will be fine. :)

The "characterized as an antiaromatic system[23]" is "T. K. Zywietz, H. Jiao, 
P. v. R. Schleyer and A. de Meijere, J. Org.Chem., 1998, 63, 3417" at 
https://pubs.acs.org/doi/abs/10.1021/jo980089f .


Cheers,

Andrew
da...@dalkescientific.com

[*] 
https://www.researchgate.net/profile/Shmuel_Zilberg/publication/51175586_Towards_experimental_determination_of_conical_intersection_properties_A_twin_state_based_comparison_with_bound_excited_states/links/561bb5bc08ae6d17308b037f/Towards-experimental-determination-of-conical-intersection-properties-A-twin-state-based-comparison-with-bound-excited-states.pdf



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] acepentalene aromaticity perception

2020-01-09 Thread Andrew Dalke
Hi all,

Could someone explain the following, which uses the SMILES from 
https://en.wikipedia.org/wiki/Acepentalene :

>>> from rdkit import Chem
>>> Chem.CanonSmiles("C1=CC2=CC=C3C2=C1C=C3")
'c1cc2ccc3ccc1-c=3-2'
>>> import rdkit
>>> rdkit.__version__
'2019.09.1'

I don't understand the aromatic "c" in the fused center of the 3 5-membered 
rings. It's connected by non-aromatic bonds to the rest of the system.

This broke some code of mine which expects that every aromatic atom must have 
at least two aromatic bonds. I thought that all aromatic atoms had to be in 
aromatic rings, and that all aromatic rings had to have aromatic bond.

(I'm ignoring RDKit's support for aromatic triple bonds in this description.)

I searched for "acepentalene" and "antiaromatic" in the issue tracker and the 
mailing list but found nothing relevant.

Cheers,

Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clearing isotope info

2019-12-12 Thread Andrew Dalke
On Dec 12, 2019, at 17:39, Rafal Roszak  wrote:
> I also had situation when I need to generate smiles with either
> isotopes or stereochemistry but not both. Maybe it is worth to add two
> options to ChemMolToSmiles function:
> 
> dontIncludeStereochemistry=True/False
> dontIncludeIsotopes=True/False
> 
> Right now it is not straightforward to generate smiles w/o isotopes
> (but with stereochemistry) - one need to remove isotope, export to
> smiles and restore isotopes.

Bear in mind a few complications. I believe the following correctly implements 
what you describe:

from rdkit import Chem

def MolWithoutIsotopesToSmiles(mol):
  atom_data = [(atom, atom.GetIsotope()) for atom in mol.GetAtoms()]
  for atom, isotope in atom_data:
  if isotope:
  atom.SetIsotope(0)
  smiles = Chem.MolToSmiles(mol)
  for atom, isotope in atom_data:
  if isotope:
  atom.SetIsotope(isotope)
  return smiles


>>> mol = Chem.MolFromSmiles("[19F][13C@H]([16OH])[35Cl]")
>>> MolWithoutIsotopesToSmiles(mol)
'O[C@@H](F)Cl'

Testing reveals two problems with my implementation:

1) isotopic hydrogens 

Consider the same structure with tritium instead of fluorine:

>>> mol = Chem.MolFromSmiles("[3H][13C@H]([16OH])[35Cl]")
>>> MolWithoutIsotopesToSmiles(mol)
'[H][C@H](O)Cl'

That output should be 'OCCl'.

2) stereochemistry assignment needs to be recalculated after the isotopes have 
been removed:

>>> mol = Chem.MolFromSmiles("C[C@H]([13CH3])CI")
>>> MolWithoutIsotopesToSmiles(mol)
'C[C@@H](C)CI'
>>> Chem.CanonSmiles("C[C@H]([CH3])CI")
'CC(C)CI'

2b) This includes directional bonds

>>> mol = Chem.MolFromSmiles("C/C(=C/CO)/[11CH3]")
>>> MolWithoutIsotopesToSmiles(mol)
'C/C(C)=C/CO'
>>> Chem.CanonSmiles("C/C(=C/CO)/[CH3]")
'CC(C)=CCO'



Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] assign all bond directions in SMILES

2019-11-19 Thread Andrew Dalke
Hi all,

  Is there any way to assign all bond directions (E/Z stereochemistry) to the 
output SMILES string?

For example, here's a structure:

>>> mol = Chem.MolFromSmiles(r"F/C(Cl)=C(O)/N")
>>> Chem.MolToSmiles(mol)
'N/C(O)=C(/F)Cl'

It's a minimal definition, in that I could have specified the directions for 
all of the bonds:

>>> mol = Chem.MolFromSmiles(r"F/C(/Cl)=C(\O)/N")
>>> Chem.MolToSmiles(mol)
'N/C(O)=C(/F)Cl'

Note that RDKit figured out which bond directions were minimal.

The underlying code checks for conflicting assignments: 

>>> mol = Chem.MolFromSmiles(r"F/C(/Cl)=C(/O)/N")
[18:25:25] Conflicting single bond directions around double bond at index 2.
[18:25:25]   BondStereo set to STEREONONE and single bond directions set to 
NONE.
>>> Chem.MolToSmiles(mol)
'NC(O)=C(F)Cl'

What I want is some way to go from

  N/C(O)=C(/F)Cl

to a fully specified

  F/C(/Cl)=C(\O)/N


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] MolToSmiles preserve atom order

2019-11-18 Thread Andrew Dalke
On Nov 18, 2019, at 17:40, David Cosgrove  wrote:
> 
> Point taken. I don’t think you’d be able to get RDKit to spit such SMILES 
> strings out unless you tortured it pretty hard, however. 

Did someone mention one of my favorite things to do? :)  See:

  
http://dalkescientific.com/writings/diary/archive/2010/12/28/reordering_smiles.html

Note that that code does not preserve stereochemistry.

It's for Python 2, so change the:

  available_closures = range(100)

to

 available_closures = list(range(100))

to make it work under Python 3.

Here's what it looks like:

>>> from x import reordered_smiles
>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("OCCl")
>>> atoms = list(mol.GetAtoms())
>>> reordered_smiles(mol, [atoms[1], atoms[0], atoms[2]])
'[CH2]12.[OH1]1.[Cl]2'



Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Second call for mmpdb funding

2019-11-15 Thread Andrew Dalke
Hi all,

  The end of the year is coming up. Perhaps there's extra money in your budget 
which can go to support open source development in cheminformatics? 

As many of you know, I started a crowdfunding effort to fund improvements to 
the matched molecular pair program "mmpdb". I want to see if crowdfunding could 
help pay for long-term project sustainability in our field.

At the institutional level, this effort is very much like a standard 
consortium. If you join the crowdfunding effort, you get new features which are 
not available to the general public, such as a command-line tool to export the 
rules and Postgres support. These are available right now.

As more people join the consortium, I will commit to adding new features. One 
which interests many people is to replace the existing opaque hash environment 
fingerprint with a fragment SMILES. That will be funded if two additional 
corporate members join.

Everyone who joins gets the new features under a permissive open source 
license. If enough people join, then I will commit to releasing the software to 
the public in the fall of next year.

If you want mmpdb to be supported and improved, then you should get your 
organization to join the funding effort.

Full details at http://mmpdb.dalkescientific.com/ .

Best regards,

Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] missing MolFromSmiles error output in Jupyter

2019-10-16 Thread Andrew Dalke
Dear Stéphane,


> On Oct 16, 2019, at 19:39, Téletchéa Stéphane 
>  wrote:
> Did you 'by chance' transmit your presentation in PDF?

Yes, I exported my Keynote.app presentation to PDF.

However, I also sent the specific commands in email as plain text, as part of 
the process of trying to diagnose the problem.

> This may come from a weird bug in quote management from a PDF file when you 
> copy/paste double quotes,

The student sent me screen grabs. I can see that quotes are not the issue.

If they were, I would expect to see a "SyntaxError: invalid character in 
identifier" as Python doesn't recognize left-handed or right-handed double 
quotes in a way that doesn't cause an error. Instead, the MolFromSmiles() 
appears to be evaluated.

For what it's worth, I've long since disabled smart quotes in Keynote.app, and 
haven't had this problem with other students, which includes students working 
on Windows.

The only thing I can see that's different is that my previous students used a 
2018 release of RDKit, not 2019. And of course an older version of Jupyter.


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] missing MolFromSmiles error output in Jupyter

2019-10-16 Thread Andrew Dalke
Hi all,

  I wasn't able to give my RDKit training session at the last UGM, so I passed 
out the presentation materials to the students who signed up. One of them wrote 
to me asking why the following didn't display an error message in the notebook.

  from rdkit import Chem
  from rdkit.Chem.Draw import IPythonConsole
  mol = Chem.MolFromSmiles("Q")

On my machine I get a red message from stderr:

  RDKit ERROR: [15:41:35] SMILES Parse Error: syntax error for input: 'Q'

so I don't know why it doesn't work for the student (who sent me screenshots 
demonstrating the lack).

sys.version says:

  3.7.4 (default, Aug  9 2019, 18:34:13) [MSC v.1915 64 bit (AMD64)]

(which means it's an Anaconda build for MS Windows)

and rdkit.__version__ is

  2019.03.4

(I have an older version of RDKit on my laptop, which I used to verify that I 
get the message.)

Any idea on what's going on, or what I can suggest trying?


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] SubstructMatch of identical Mols returns different results

2019-10-03 Thread Andrew Dalke


On Oct 3, 2019, at 20:34, Ondrej Gutten via Rdkit-discuss 
 wrote:
> # MCS is a benzene
> my_mcs = Chem.MolFromSmiles(res.smarts)

The res.smarts (or res.smartsString if you use the rdFMCS module) returns a 
SMARTS string, not a SMILES string. You should be using Chem.MolFromSmarts() in 
the code I quoted.

More specifically, the MCS SMARTS pattern is:  [#6]1:[#6]:[#6]:[#6]:[#6]:[#6]:1

This can be interpreted as a SMILES, because RDKit supports "#6" as part of its 
extensions to the SMILES grammar. The "#6" means an element with atomic number 
6.

  >>> Chem.CanonSmiles("[#6]1:[#6]:[#6]:[#6]:[#6]:[#6]:1")
  '[c]1[c][c][c][c][c]1'

However, benzene has a different SMILES:

  >>> Chem.CanonSmiles("c1c1")
  'c1c1'
  >>> Chem.MolToSmiles(Chem.MolFromSmiles("c1c1"), allBondsExplicit=True, 
allHsExplicit=True)
  '[cH]1:[cH]:[cH]:[cH]:[cH]:[cH]:1'

Each of the benzene atoms has a hydrogen on it.

The difference appears when you call:

   mol1.HasSubstructMatch(my_mcs)

There is a difference if my_mcs is a molecule from a SMILES vs. from a SMARTS. 
They have different definitions of what it means to match. One of the 
differences is that a SMILES-made molecule considers the hydrogen counts:

>>> mol = Chem.MolFromSmiles("OCc1c1")
>>> query = Chem.MolFromSmiles("OC");mol.GetSubstructMatches(query)
((0, 1),)
>>> query = Chem.MolFromSmiles("[O]C");mol.GetSubstructMatches(query)
()
>>> query = Chem.MolFromSmiles("[OH]C");mol.GetSubstructMatches(query)
((0, 1),)

while if I made the query from a SMARTS:

>>> query = Chem.MolFromSmarts("[O]C");mol.GetSubstructMatches(query)
((0, 1),)


Cheers,


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] mmpdb crowdfunding project has started

2019-09-23 Thread Andrew Dalke
Hi all,

  In August I sent a pre-announcement email about my mmpdb crowdfunding project.

The project is now live, at http://mmpdb.dalkescientific.com/ .

The basic idea is that I can commit to developing a few features for mmpdb.

  • Postgres support, as an alternative to the existing SQLite support;
  • A new ‘mmpdb proprulecat’ command to export the property rules in the 
database (transformation details and property statistics of the associated 
pairs) in CSV form, along with a fragment SMILES for the given environment.

Anyone who joins the consortium (at least EUR 1 000 for academics, EUR 5 000 
for industry) will get these features under the existing open source license.

As more people join, the additional funding will be used to improve mmpdb. All 
consortium members will get the new improvements. These new features will only 
be implemented if a certain funding level is reached, for example, with EUR 16 
000 in funding I'll add support for storing and using SMILES strings for the 
environment fragments.

To incentivize people to join, I don't plan to release the code to the public 
until receiving at least EUR 23 000 in funding, in which case I'll release it 
by 1 October 2020. (Since it's open source, consortium members are also free to 
distribute the new code.) With EUR 50 000 in funding, I'll release the software 
to the public as soon as I can.

The full list of goals and funding levels is at 
http://mmpdb.dalkescientific.com/goals.html .

I'll be at the RDKit UGM in Hanburg in a couple of days, so if you are there 
and have questions about it, or about funding open source software in 
chemformatics (or about chemfp, or SMILES, or many other topics), come talk 
with me!

Cheers,

Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Tanimoto and fingerprint representation

2019-09-14 Thread Andrew Dalke
Hi Jan,

  The GetMorganFingerprint() returns count fingerprints, and the Tanimoto 
calculation does the full Jaccard similarity, including the counts. 

The GetMorganFingerprintAsBitVect() version only uses the keys (that is, it 
treats all non-zero values as being 1) when computing the Tanimoto.

> On Sep 14, 2019, at 11:07, Jan Halborg Jensen  wrote:
> 
> When using GetMorganFingerprintAsBitVect I get the “expected” Tanimoto score
> 
> mol1 = Chem.MolFromSmiles('CCC')
> mol2 = Chem.MolFromSmiles('CNC')
> 
> fp1 = AllChem.GetMorganFingerprintAsBitVect(mol1,2,nBits=1024)
> fp2 = AllChem.GetMorganFingerprintAsBitVect(mol2,2,nBits=1024)

>>> list(fp1.GetOnBits())
[33, 80, 294, 320]
>>> list(fp2.GetOnBits())
[33, 128, 406, 539]

You can see the intersection is 1 and the union is 7, giving 1/7 = 0.142... as 
the Tanimoto, which is what you demonstrated was the result.

> However, when using GetMorganFingerprint I get a difference score.
> 
> fp1 = AllChem.GetMorganFingerprint(mol1,2)
> fp2 = AllChem.GetMorganFingerprint(mol2,2)

>>> fp1.GetNonzeroElements()
{2068133184: 1, 2245384272: 1, 2246728737: 2, 3542456614: 2}
>>> fp2.GetNonzeroElements()
{847961216: 1, 869080603: 1, 2246728737: 2, 3824063894: 2}

Note that there is one shared key (2246728737) while the other 7 are unique. 
The binary Tanimoto - treating all counts as 1 - gives 1/7, matching the 
BitVect version.

On the other hand, the common value 2246728737 is present 2 times in each 
fingerprint, and 3542456614 and 3824063894 are each present twice in one 
fingerprint, so the Jaccard, or count Tanimoto, is

   2 / ((1+1+2+2)+(1+1+2+2)-2) = 2/10 = 0.2

matching the value you computed.


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] aromatic bonds and graph edit distance

2019-08-21 Thread Andrew Dalke
Hi Jameed,

  I don't think your approach will work, which means I likely didn't explain 
myself well enough.

Let's say I start with:

   Cc1cc2c2c(=O)o1 - 
https://cactus.nci.nih.gov/chemical/structure/Cc1cc2c2c(=O)o1/image

I want to break the aromatic bond between the aromatic 'c' (also double-bonded 
to the O) and the aromatic 'o'.

This is possible because a Kekule interpretation of that molecule is: 

>>> from rdkit import Chem
>>> mol = Chem.MolFromSmiles("Cc1cc2c2c(=O)o1")
>>> Chem.Kekulize(mol)
>>> Chem.MolToSmiles(mol)
'Cc1cc2c2c(=O)o1'
>>> Chem.MolToSmiles(mol, kekuleSmiles=True)
'CC1=CC2=CC=CC=C2C(=O)O1'

and I could break it by simply inserting a "." in the right spot, like this:

  CC1=CC2=CC=CC=C2C(=O).O1

>>> mol2 = Chem.MolFromSmiles("CC1=CC2=CC=CC=C2C(=O).O1")
>>> Chem.MolToSmiles(mol2)
'CC(O)=Cc1c1C=O'

The resulting structure still has an aromatic ring, but not all of the atoms 
which were flagged as aromatic are still aromatic.

With the approach you outlined, I don't know how to restored the modified 
structure so I can generate the final SMILES I expect.

The Kekulize() approach finds a valid Kekule interpretation, but there may be 
multiple interpretations, allowing both a single and double bond for a given 
bond position. I think I want to generate both possibilities.

So far the only solution I've come up with - and it's a real brute-force one - 
is to do complete enumeration, along the lines of the code after my signature.


% python get_kekule_bonds.py 'Cc1cc2c2c(=O)o1'
bondfragatoms   allowed
1   c:c 1,2 DOUBLE
2   c:c 2,3 SINGLE
3   c:c 3,4 SINGLE DOUBLE
4   c:c 4,5 SINGLE DOUBLE
5   c:c 5,6 SINGLE DOUBLE
6   c:c 6,7 SINGLE DOUBLE
7   c:c 7,8 SINGLE DOUBLE
8   c:c 8,9 SINGLE
10  c:o 9,11SINGLE
11  c:o 11,1SINGLE
12  c:c 8,3 SINGLE DOUBLE

(To be usable for what I want, I also need to figure out where those bond terms 
are in the Kekeule SMILES, so I can either replace it with a "." or remove the 
ring closure, then re-perceive.)

This can work for what I'm looking for, since I have <=14 heavies, but I'm 
hoping for something rather less crude.

Andrew
da...@dalkescientific.com


# List possible Kekule bond assignments for aromatic bonds in a SMILES
from rdkit import Chem
import itertools
import sys

def get_kekule_bonds(mol):
"""mol -> possible bond assignment table

Returns a table mapping bond index (for aromatic bonds) to
possible Kekule bond assignments for that bond.
"""
# Figure out which bonds are aromatic
aromatic_bond_indices = {}
for bond in mol.GetBonds():
if bond.GetIsAromatic():
a1, a2 = bond.GetBeginAtomIdx(), bond.GetEndAtomIdx()
if a1 > a2:
a1, a2 = a2, a1
aromatic_bond_indices[(a1, a2)] = bond.GetIdx()

if not aromatic_bond_indices:
return {}

# Look for assignments which result in the same Kekeule SMILES
# (with aromatic bonds indicated as ':')
expected_kekule_smiles = Chem.MolToSmiles(mol, kekuleSmiles=True)

# Initialize the table of bond assignments
possible_bond_assignments = dict((idx, set()) for idx in 
aromatic_bond_indices.values())

# Figure out how to get from the canonical Kekule SMILES output order
# back to the moleule order.
s = mol.GetProp("_smilesAtomOutputOrder")
assert s[0] == "[" and s[-2:] == ",]", s
invert_order = dict(enumerate(map(int, s[1:-2].split(","

# Figure out the places where I can replace a ":" with a "-" or "="
parts = expected_kekule_smiles.split(":")
N = len(parts)-1
assert N == len(aromatic_bond_indices)
terms = [None] * (2*N+1)
terms[::2] = parts

# Generate all possible bond assignments (generates 2**N assignments)
for bond_assignment in itertools.product( *(("-=",)*N) ):
# Replace the places where the ":" used to be (or rather, the previous 
assignment)
# and interpret as a new SMILES
terms[1::2] = bond_assignment
new_kekule_smiles = "".join(terms)
new_mol = Chem.MolFromSmiles(new_kekule_smiles)
if new_mol is None:
continue

# Was there a match?
canonical_kekule_smiles = Chem.MolToSmiles(new_mol, kekuleSmiles=True)
if canonical_kekule_smiles != expected_kekule_smiles:
continue

# Yes! That means we can figure out what the actual bond assignments
# were. Re-process the Kekule SMILES, but don't do aromaticity 
perception.
raw_mol = Chem.MolFromSmiles(new_kekule_smiles, sanitize=False)
for bond in raw_mol.GetBonds():
# Need to map back to the original bonds
original_a1 = invert_order[bond.GetBeginAtomIdx()]
original_a2 = 

Re: [Rdkit-discuss] aromatic bonds and graph edit distance

2019-08-21 Thread Andrew Dalke


On Aug 21, 2019, at 03:42, Francois Berenger  wrote:
> Unless rdkit has something, I think graph edit distance is the kind
> of things for which you have to rely on a good graph library.

Do you know of any (non-chemical) graph library which can handle edits 
involving the breaking of aromatic bonds in a chemically correct way? I do not.

> Also, maybe the string edit distance between the two canonical smiles is a 
> good enough proxy.

This attempt of mine now, to experiment with graph edit distance, came out of a 
conversation I had last week with someone using string edit distance. I 
expressed doubt on how "good" the "good enough" was, but was unable to give any 
concrete details.

I earlier wrote:
>> For chain bonds, and non-aromatic bonds, it's easy to delete the bond
>> and add the correct number of hydrogens to either side.

Similarly, for many chain edits, the string edit distance is a decent proxy, as 
you say.

However, has the goodness ever been characterized? Along with a description of 
how to minimize the problems with string edit distance? Some of the obvious 
ones are:

1) Chirality and stereochemistry

L-alanine and D-alanine have a graph edit distance to alanine with unspecified 
chirality are 4 and 5, respectively. 

  N[C@H](C)C(=O)O
  N[C@@H](C)C(=O)O
  NC(C)C(=O)O

This does not seem reasonable. A similar issue occurs with double bond 
sterochemistry, like F/C=C/F vs. FC=CF.

2) Isotopes

Same issue: CN vs. [14CH3]N.

3) Overlapping element symbols

c1c1C and c1c1Cl have an edit distance of 1
c1c1C and c1c1Br have an edit distance of 2

There is no chemical sense for those to have different distances.

I can think of ways to mitigate some of the effects of #1-3. In particularly, a 
substitution matrix (or conversion to pharmacophore reduced graphs) can improve 
#3.

4) Sensitivity to canonicalization order

Depending on the canonicalization method, the following two structures either 
have a string edit distance of 1 or 4, while the graph edit distance is 1.

>>> Chem.CanonSmiles("PCCN")
'NCCP'
>>> Chem.CanonSmiles("CCN")
'CCN'


5) difficulty in handling ring formation in a meaningful way

>>> Chem.CanonSmiles("C1=CC=CC=C1")
'c1c1'
>>> Chem.CanonSmiles("C=CC=CC=C")
'C=CC=CC=C'

There are no shared string synbols, so the string edit distance is 9, yet the 
bond edit distance is only 1.

It is this last issue that I am particularly concerned with, leading me to ask 
about how to handle aromatic bonds when computing the graph edit distance.


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] aromatic bonds and graph edit distance

2019-08-20 Thread Andrew Dalke
Hi all,

  Someone asked me recently about finding the graph edit distance of two small 
(<= 14 atom) fragments.

I figured this was something that could be brute forced. Following SmallWorld's 
example at https://cisrg.shef.ac.uk/shef2016/talks/oral13.pdf , given a 
fragment, incrementally delete terminals (except the "*" connection point 
atom), and ring bonds.

For chain bonds, and non-aromatic bonds, it's easy to delete the bond and add 
the correct number of hydrogens to either side.

But, what should I do when I cut an aromatic bond?

For something like the first "co" in "c1cocn1", I want the result to be 
C=CN=CO. That's because the "o" can only be "-O-" in Kekule form.

For something like "c1cnncn1", breaking on the "nn", I think I would like to 
get both 'N=CC=NC=N' and 'NC=CN=CN' because the "nn" can be a single or a 
double bond, depending on the Kekule representation, as in:

>>> Chem.CanonSmiles("C-1=N-N=C-C=N-1")
'c1cnncn1'
>>> Chem.CanonSmiles("C-1=N.N=C-C=N-1")
'N=CC=NC=N'

>>> Chem.CanonSmiles("C=1-N=N-C=C-N=1")
'c1cnncn1'
>>> Chem.CanonSmiles("C=1-N-[HH].[HH]N-C=C-N=1")
'NC=CN=CN'

Problem is, I don't know how to figure out if a given aromatic bond must be a 
"-" or "=", or can be both.

(Well, I could brute-force enumerae all 2**n possible aromatic bond 
assignments, then canonicalize, and see if both assignments are possible for a 
given bond.)

As a non-chemist, I also ask if I'm even on a chemically meaningful track.


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] GetSubstructMatches() as smiles

2019-08-07 Thread Andrew Dalke
On Aug 7, 2019, at 13:08, Paolo Tosco  wrote:
> You can use
> 
> Chem.MolFragmentToSmiles(mol, match)
> 
> where match is a tuple of atom indices returned by GetSubstructMatch().

Note however that if only the atom indices are given then 
Chem.MolFragmentToSmiles() will include all bonds which connect those atoms, 
even if the original SMARTS does not match those bonds. For example:

>>> from rdkit import Chem
>>> pat = Chem.MolFromSmarts("*~*~*~*") # match 4 linear atoms
>>> mol = Chem.MolFromSmiles("C1CCC1") # ring of size 4
>>> atom_indices = mol.GetSubstructMatch(pat)
>>> atom_indices
(0, 1, 2, 3)
>>> Chem.MolFragmentToSmiles(mol, atom_indices)  # returns the ring
'C1CCC1'


If this is important, then you need to pass the correct bond indices to 
MolFragmentToSmiles(). This can be done by using the bonds in the query graph 
to get the bond indices in the molecule graph. I believe the following is 
correct:

def get_match_bond_indices(query, mol, match_atom_indices):
bond_indices = []
for query_bond in query.GetBonds():
atom_index1 = match_atom_indices[query_bond.GetBeginAtomIdx()]
atom_index2 = match_atom_indices[query_bond.GetEndAtomIdx()]
bond_indices.append(mol.GetBondBetweenAtoms(
 atom_index1, atom_index2).GetIdx())
return bond_indices

(Does a function like this already exist in RDKit?)

I'll use it to get the bond indices for the *~*~*~* match:

>>> bond_indices = get_match_bond_indices(pat, mol, atom_indices)
>>> bond_indices
[0, 1, 2]

Passing the atom and bond indices gives the expected match SMILES: 

>>> Chem.MolFragmentToSmiles(mol, atom_indices, bond_indices)
''

Cheers,

Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] planning a crowdsourcing project for mmpdb development

2019-07-31 Thread Andrew Dalke
ent. I can understand, because I 
know I'm weirded out by the restaurants which allow you to pay what you think 
the food was worth!

Instead, here are the funding levels:

  Academic/Individual - EUR 1 000
  Research group - EUR 3 000
  Organization - EUR 5 000

All crowdsourcing consortium members will receive the mmpdb source code under 
the existing 3-clause BSD license.

Those wishing to pay an additional EUR 3 000 will receive 1 year of support. 
That EUR 3 000 will added to the sum used to determine if a funding goal is met.

The funding status will be made public. Funders can be left anonymous or agree 
to have their names used.

I generally use purchase orders, paid to my company's Swedish bank account. I 
can accept credit card payments through PayPal, though will charge 10% more 
because of the overhead.

If enough people commit to the project, I will send out a formal quote to the 
backers, and start the purchase order process.


Timeline


As a rough timeline:

 Today - Pre-announcement on the RDKit list for planning purposes
 1 September - Announcement on the RDKit list
 28 September - Lightning talk at the RDKit UGM
 1 October - Second announcement on the RDKit list
 1 November - Call off the trial if there has not been enough interest
 1 December - End of fundraising / see which funding goals are met

Note: I define "interest" as people sending me an email saying they/their 
company wants to join, at a given funding level. It does not mean that I have a 
purchase order from everyone. It should be seen as a verbal agreement, not a 
written or financial commitment.

Project work will likely start shortly after the interest level meets the basic 
level goal.

If this trial proves very successful (over EUR 30 000 or so), then one likely 
future crowdsourcing project will be to develop a web interface for mmpdb.


Questions & Answers
===

1) Is it really "open source development" if it isn't available to everyone for 
free?

Most well-known open source software is also available at no cost. However, 
open source, and free software, describe the rights and freedoms available to 
the person who receive the software. They do not require that the software be 
made available for no cost.

Think if you modify a GPL project like Open Babel, and distribute the changes 
to a friend. There's nothing about the GPL, or the principles of open source, 
which require you to also distribute that code to anyone else.

I am influenced by Stallman's essay at 
https://www.gnu.org/philosophy/selling.html . He states "if you are 
redistributing copies of free software, you might as well charge a substantial 
fee and make some money. Redistributing free software is a good and legitimate 
activity; if you do it, you might as well make a profit from it. ... 
Distributing free software is an opportunity to raise funds for development. 
Don't waste it!"


2) What happens to the code if the 'Delayed upstream level' goal of EUR 20 000 
isn't reached?

That's a tough one. At the very least, any of the backers who receive a copy of 
the code could contribute the code to the main mmpdb repository in the RDKit 
project.

If there is little interest, I will likely release it after 9 months or a year, 
and also reduce my involvement in developing open source cheminformatics 
software, as I cannot continue to do things for free or low wages.


3) What about Oracle support, or feature X?

Then let me know. Part of my goal now is to figure out which to include as the 
features to implement at the basic level.

If there's more interest, I might have two feature levels.

In any case, I am also available for consulting work.


3) Where should I send questions and suggestions?

Right now, private email to me is the best. I'll set up a mailing list and 
project web page if I get preliminary feedback that it's worth my time to go 
further with this trial.

Thanks for reading to the end!

Andrew Dalke
da...@dalkescientific.com

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Open-source business models and the RDKit (Greg Landrum)

2019-04-01 Thread Andrew Dalke
On Mar 27, 2019, at 13:26, Chris Swain via Rdkit-discuss 
 wrote:
> This is an interesting discussion and suspect this does not only apply to 
> open-source software developers, there are similar challenges for small 
> independent software companies.

My points were focused on the disadvantages of a pure open source software 
development strategy with respect to proprietary software development.

There are, of course, many other problems which are shared between small 
independent software vendors of both free and proprietary software. On the 
other hand, there are also many existing resources on developing commercial 
propriety software as a small ISV.

> On CICAG (http://www.rsccicag.org) we have been discussing the possibility of 
> organising a (probably one day) meeting around the topic of open 
> data/software/publishing and sustainability.

At this point I cannot recommend that any ISV consider an open source product 
as a viable strategy, not even for marketing purposes as name recognition for 
future consulting work.

My current belief is that people hire a consultant to solve problems. They hear 
from talks or from other people that person X can help solve problems, and hire 
X.

If person X writes tool Y to solve certain problems, and sells that as a 
proprietary product, then people will contact X in order to purchase it to 
solve their problem.

If tool Y is distributed as open source, then they don't really have a problem, 
because they can solve their problem themselves by installing a package, and it 
works. That only takes a few minutes. They may not even know who X is.

This is especially true if it's distributed through the usual channels like 
(Bio)conda, PyPI, and DebianScience.

As I've pointed out before, I've seen innumerable posters where the poster 
author thanks a commercial vendor for a no-cost academic license for a product, 
but does not thank the authors of the free software packages also used. Which 
is perfectly acceptable under the respective licenses, but I think also an 
indicator that people feel more obligation for someone who actively helps 
solves their problems than for something like  "pip install Y".

Regards,

Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Open-source business models and the RDKit

2019-04-01 Thread Andrew Dalke
On Mar 27, 2019, at 16:44, Bennion, Brian via Rdkit-discuss 
 wrote:
> One of the goals of ATOM is to fund work that will be open sourced.  I think 
> any of the partners can choose to hire consultants for the work.
> 
> https://atomscience.org/
> Atom
> atomscience.org

I think there are only three successful external funding strategies for open 
source in cheminformatics: 1) grant funding, 2) cost reduction, and 3) R 
budget for new, and typically pre-competitive, software. 

It appears that ATOM gets it funding from #1 and #3. And I have no problem with 
that. I've been funded as a consultant from grant funding.

Most of RDKit was funded (if I understand it correctly) by #2 and #3, but the 
funding is smaller and less direct now that Greg has left his previous employer.

My interest is in successful models for self-funded open source developer. Who 
will pay to maintain and develop "Our Digital Infrastructure" (to use the term 
from 
https://www.fordfoundation.org/about/library/reports-and-studies/roads-and-bridges-the-unseen-labor-behind-our-digital-infrastructure/
 )?

I'll be more specific. How can we get EUR 250K/year to fund RDKit development? 
This would cover a full-time developer at commercial wages, plus overhead, and 
have money available to pay people for special work - new features, 
performance, documentation, and so on.

It also makes project continuity easier. Not only is it easier for Greg to 
maintain and develop the project if there's a clear revenue source, but it also 
increases the pool of people who might be able to replace him, whenever he 
decides to step down.

In principle the money is there, given the number of commercial users with 
extra money.

The questions I ask again are "why would they contribute money?", "why haven't 
they contributed money?", and "how do we effectively encourage them to 
contribute money in the future?".

By comparison, we know that it's possible to make a living selling self-funded 
proprietary software in this field.


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Open-source business models and the RDKit

2019-03-27 Thread Andrew Dalke
On Mar 27, 2019, at 08:24, Francois Berenger  wrote:
> As an open-source project, I feel rdkit is quite successful.
> So, the user community is not so small.
> Some people who cannot contribute time could contribute money to the project
> (especially if it is tax-deductible, I guess).

I think the questions are "why would they contribute money?" and "why haven't 
they contributed money?".

If those questions cannot be answered well, then there's little reason to go 
further down this path to the next question, which is "how do we effectively 
encourage them to contribute money in the future?".

To be clear, Novartis contributed a lot of money for the RDKit development. 
Roche also funded me to develop and contribute the MCS package now part of the 
RDKit core, and the mmpdb project which was contributed to RDKit. These are 
also financial contributions and must not be ignored, and these are not the 
only two organizations which have done that.

But I honestly thought that there would be more interest in hiring my services 
as a consultant, to work on further development of open source software. I feel 
like there are clear economic benefits for companies to fund open source 
packages.

Instead, it feels like the more open source software packages I write and 
release, the fewer leads I get for new consulting work, compared to when I gave 
"I wrote this in-house application for company X that no one else will ever 
use" talks. Perhaps what's easily available for no cost is seen as having no 
value, while that which is hidden, no matter how hacky, is treasured?

My optimism started 20 years ago, when I was still involved with the Biopython 
project. My company offered commercial support for Biopython, and I had NDAs in 
place with several of the other Biopython developers so we could easily be 
funded to work on specific improvements that an organization might need.

I never found someone interested in providing that sort of funding for 
Biopython, and it still looks like that's the case in cheminformatics.

See also 'Roads and Bridges: The Unseen Labor Behind Our Digital 
Infrastructure' (ref. 53 in my paper) for further examples of the difficulties 
in funding open source work. 
https://www.fordfoundation.org/about/library/reports-and-studies/roads-and-bridges-the-unseen-labor-behind-our-digital-infrastructure/


> On Mar 27, 2019, at 10:06, Greg Landrum  wrote:
> If rdkit was accepted at the software freedom conservancy, I understand
> the management fee would be 10%:

There's also Software in the Public Interest, which "serves the free software 
and open source community by facilitating the administrative and financial 
needs of its associated projects", including the Open Bioinformatics 
(ex-)Foundation.

When the OBF was created, it was common for many groups to start their own 
foundations. Since most of the administrative needs are the same for the 
different projects, it makes sense to consolidate.

> A question since I genuinely don't know: is it important to anyone that this 
> go through a not-for-profit entity?

The OBF became a not-for-profit to make it easier to organize the BOSC 
(Bioinformatics Open Source Conference) meetings. Some of the early BOSC 
meetings were run out of someone's personal bank account, and he was personally 
financially liable in case of problems.

Working through a non-profit makes it easier to set up things like summer 
internships (a la Google Summer of Code) and travel support, because the 
payment is less likely to be viewed as a way to get around employment laws. 
Open Bioinformatics has a Travel Fellowship program. I don't know the details.

Looking at the report for 2018 at 
http://spi-inc.org/corporate/annual-reports/2018.pdf , Open Bioinformatics 
spends about $5,000/year for IT and meet ups, an "ordinary income" of $5,400, 
and an equity of $85K.

There's overhead to running a non-profit, like filing paperwork, and that 
requires specialized knowledge. For revenues that small, it really helps to be 
affiliated with an existing umbrella organization. The OBF gave up their 
incorporation in 2012 to be an SPI-associated project.

For what RDKit does now, I see no need to set up/join a foundation. T5 
Informatics can organize an RDKit UGM the same way that any vendor can organize 
a UGM, and company acts as the firewall to your personal finances. T5 (or Dalke 
Scientific :) can also act as an intermediary if, for some reason, a company 
does not wish to fund someone directly. Though you'll have your own overhead in 
that case, because of the additional tax requirements in dealing with 
subcontractors.

No matter what, that's going to be easier than arranging things through a 
university, even Paul's.

It's only if you start getting multiple organizations interested in 
contributing funding, or really want the transparency that T5 and RDKit funding 
are not intermingled, where I would suggest looking at that.

Another reason is if you want RDKit-the-organization 

Re: [Rdkit-discuss] chemfp preprint

2019-03-26 Thread Andrew Dalke
On Mar 25, 2019, at 04:05, Francois Berenger  wrote:
> Sometimes, I wish there was a rdkit consortium/NPO (so that donations are tax 
> deductible), so that rdkit could be massively funded by all its commercial 
> users, and even accepting individual donations.

Setting up such an organization is not difficult. It does take time, money, and 
effort, which add overhead to the funding process. It also requires people who 
are willing to do that sort of work. I was on the board of the Open 
Bioinformatics Foundation, and involved with the Python Software Foundation. I 
know that I am *not* that sort of person.

In any case, as Geoff Hutchinson points out, there are umbrella organizations 
like the Open Source Collective which can handle most of that overhead.

My question is, why would users - commercial or otherwise - be willing to fund 
such work in the first place?

As far as I can tell, nearly everyone uses free and open source software 
because they are available for no cost. Users are rarely willing to pay for 
software freedom, or for economic benefits like avoiding vendor lock-in.

And it seems like commercial users are often willing to use an internal fork of 
a project than to work with upstream to develop new features. This might be 
because it's easier to work with existing staff or existing contracting 
arrangements than figuring out how to get upstream to do the work and setting 
up a new contractual relationship, and take the risk that it isn't done to 
schedule.

My conjecture is that there are several issues at play.

1) Most end users don't realize there is a funding problem for many FOSS 
projects. Package managers like pip/PyPI, Conda, Homebrew and apt make it 
*really easy* to install a large number of packages without knowing anything 
about the funding or staffing status of each underlying project.

Consider that one of the early business models for PyMol was the idea that 
people would be willing to pay for pre-compiled packages from the main 
developer, even though the source code was available for free as open source. 
That business model somewhat worked then. It would not work now.

2) The proponents of "open source" in the late 1990s emphasized the volunteer 
nature of open source, going so much as to argue that there was a "gift 
culture" (using E. Raymond's term). The implication is that there was a sort of 
social contract, where donations of source code would be met with other sorts 
of payment, including job/consulting offers and non-trivial amounts of 
reciprocal code contributions. 

This has not turned out to be true, with rare exceptions. Instead, I think the 
association with volunteerism and gifts has caused people to avoid talking 
about fund raising. This should be particularly odd as many volunteer 
organizations outside of computing have funding drives.

3) FOSS developers who distribute at no cost are ignoring any capital value in 
the software. They can only make income on gifts (which are rare) or through 
labor (e.g., consulting). This places them at a funding disadvantage compared 
to proprietary software vendors who can amortize labor costs across multiple 
sales.

To be clear, I am only talking about self-funded FOSS projects. My paper 
mentions a few other funding models, like research grants at universities, or 
in-house projects funded by the ability to reduce costs. In the latter case, 
the minor additional costs for releasing the project as FOSS can be justified 
by even small benefits.

4) The pricing of per-unit sales of FOSS software, either institutional sales 
like what I tried with chemfp,  or end-user sales like PyMol, should factor in 
the likelihood that customers will redistribute the software further, and by 
doing so reduce the market size. This factor is hard to estimate, and higher in 
general for universities than pharmaceutical companies, which makes it harder 
to give a significant discount to universities like what proprietary vendors 
can do.

5) In my paper I bring up "free rider problem" as a way to think of the issues. 
To be clear, this is only a *problem* if people expect anything back from 
releasing and/or maintaining an open source software project. (Or don't expect 
people to insist on support, like I have received for the no-cost/open source 
version of chemfp.)

Suppose I want to add a new feature to mmpdb, the matched molecular pair 
program which I helped develop and has been contributed to the RDKit project. I 
might go around to various users and ask for development funding as a 
consortium. 20 organizations might be interested, and each one willing to pay 
50% of the development cost, which means in principle I could get 10x the cost 
of labor, which provides the extra profit that could go towards documentation, 
testing, and general support.

However, it's also easy for each of those 20 organizations to think that 
someone else ("Let George do it", as the Stanford Encyclopedia of Philosophy 
article explains) is going 

[Rdkit-discuss] chemfp preprint

2019-03-22 Thread Andrew Dalke
Hi RDKit users,

  This week I submitted a paper about chemfp for publication. I also submitted 
a preprint on ChemRxiv, which was just accepted.

For those interested, it's at 
https://chemrxiv.org/articles/The_Chemfp_Project/7877846 .

It's a rather long paper as it covers many aspects about the chemfp project, 
including the FPS and FPB formats, search algorithms, details about the 
different ways to compute a popcount, and memory bandwidth and latency 
bottlenecks. On a non-technical level I also describe some of the difficulties 
I ran into trying to run chemfp as "commercial free software."

Let me know of any corrections or improvements, or any other feedback you might 
have.

Cheers,

Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] contrib code not compiling

2018-11-20 Thread Andrew Dalke
On Nov 19, 2018, at 04:17, Rajarshi Guha  wrote:
> Hi, I check out the latest RDKit sources from master and I'm trying to 
> compile the PBF.  However, the compilation fails reporting that 
> RDGeneral/export.h is missing:

While this doesn't answer the question, it seems to be coupled to

  https://github.com/rdkit/rdkit/issues/1903

> (I haven't compiled RDKit as I already have it installed via conda)

It appears that 'export.h' is created during the "cmake" step. That file is an 
'auto-generated __declspec definition header'.

Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-26 Thread Andrew Dalke
On Sep 26, 2018, at 20:26, Peter S. Shenkin  wrote:
> Ah, David, but how do you define a "real" singleton?

There can be many different definitions of what a '"real" singleton' might be, 
but we are specifically talking about Butina clustering.

The Butina paper defines the term "false singleton", which Dave quoted. The 
relevant text from DOI: 10.1021/ci9803381 is:

"""The molecules that have not been flagged by the end of the clustering 
process, either as a cluster centroid or as a cluster member, become 
singletons. It is important to emphasize at this stage that one of the 
consequences of this approach is that some molecules defined as singletons may 
have neighbors at the given Tanimoto similarity index, but those neighbors have 
been excluded by a ‘stronger’ cluster centroid, i.e., one with more neighbors 
in its list.  the problem with the creation of a number of false singletons 
that do in fact have similar compounds within the set is easily offset by the 
final quality of the clusters that this approach generates."""

As you can see, there are two types of singletons, and one is called "false 
singleton". No specific name is used for the other type of singleton, but it's 
easy to how they can be called "real" singletons, without confusion or 
misunderstanding.

(FWIW, my implementation, mentioned in an earlier email, uses the term "true 
singleton" as the singleton which is not a "false singleton", but the 
difference is only in the label.)

To confirm that this is what Dave means, I'll quote from his paper 

Blomberg, N., Cosgrove, D. A., Kenny, P. W., & Kolmodin, K. (2009). Design of 
compound libraries for fragment screening. Journal of Computer-Aided Molecular 
Design, 23(8), 513–525. doi:10.1007/s10822-009-9264-5

"""The clustering program flush_clus is an implementation of the 
sphere-exclusion algorithm of Taylor [41], which has also been reported 
independently by Butina ... One consequence of the algorithm is the production 
of ‘false singleton clusters.’ The final clusters in the output are invariably 
singleton clusters, where the only member is the seed. Some of these will be 
true singletons, i.e. molecules lacking neighbors within the clustering 
threshold, but others (the false singletons) will be singletons by virtue of 
the fact that their neighbors were placed in other larger clusters in a 
previous iteration of the algorithm. The flush_clus program offers the 
opportunity of performing a final sweep through the clusters using a larger 
similarity threshold and placing the singleton molecules within the cluster for 
which it has the greatest similarity with the seed, so long as this is within 
the threshold."""

Cheers,


Andrew
da...@dalkescientific.com



___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-25 Thread Andrew Dalke
On Sep 25, 2018, at 17:13, Peter S. Shenkin  wrote:
> FWIW, in work on conformational clustering, I used the “most representative” 
> molecule; that is, the real molecule closest to the mathematical centroid. 
> This would probably be the best way of displaying a single molecule that 
> typifies what is in the cluster. 

In some sense I'm rephrasing Chris Earnshaw's earlier question - how does one 
do that with Taylor-Butina clustering? And does it make sense?

The algorithm starts by picking a centroid based on the fingerprints with the 
highest number of neighbors, so none of the other cluster members should have 
more neighbors within that cutoff.

I am far from an expert on this topic, but with any alternative I can think of 
makes me think I should have started with something other than Taylor-Butina.



Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Butina clustering with additional output

2018-09-25 Thread Andrew Dalke
On Sep 21, 2018, at 14:53, Philipp Thiel  
wrote:
> you probably read about the Tanimoto being a proper metric in case of having 
> binary data
> in Leach and Gillet 'Introduction to Chemoinformatics' chapter 5.3.1 in the 
> revised edition.

What we call Tanimoto is more broadly known as the Jaccard. Various sites 
demonstrate that the Jaccard distance = 1-Jaccard = 1-Tanimoto is a metric, 
such as 
https://mathoverflow.net/questions/18084/is-the-jaccard-distance-a-distance and 
https://arxiv.org/abs/1612.02696 .

Going back to James T. Metz's original question, one alternative might be to 
use chemfp and the Taylor-Butina clustering implementation available at: 

  http://dalkescientific.com/writings/taylor_butina.py

Following Dave Cosgrove's advice: 

> I expect James means what we used to call the cluster seed, i.e. the molecule 
> the cluster was based on, rather than the mathematical centroid. Calculating 
> distances from each cluster member to that would be quite straightforward as 
> a post-processing step although that would roughly double the time taken. 

it's possible to change the reporting code from:

for centroid_idx, members in clusters:
print(arena.ids[centroid_idx], "has", len(members), "other members", 
file=outfile)
print("=>", " ".join(arena.ids[idx] for idx in members), file=outfile)

so it does the post-processing:

print(len(clusters), "clusters", file=outfile)
for centroid_idx, members in clusters:
print(arena.ids[centroid_idx], "has", len(members), "other members", 
file=outfile)
subarena = arena.copy(indices=members)
centroid_fp = arena.get_fingerprint(centroid_idx)
result = subarena.threshold_tanimoto_search_fp(centroid_fp, 
threshold=0.0)
result.reorder()  # sort so the highest scores come first
for id, score in result.get_ids_and_scores():
print("=>", id, "score:", score)


Cheers,

Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Equivalent atom neighbours

2018-09-07 Thread Andrew Dalke
On Sep 7, 2018, at 22:22, Alexey Orlov  wrote:
> I'm trying to calculate the number of equivalent/nonequivalent neighbor 
> heteroatoms for each atom i of molecule m.
> 
> For examples, the third carbon atom of molecule CC(OH)CC has two 
> nonequivalent neighbors: one carbon atom connected to OH and CH3 and another 
> carbon atom that is connected only to H.  In the case of molecule CC(OH)C(C)C 
> the third carbon atom has two equivalent neighbor heteroatoms (methyl groups) 
> and another neighbor distinct from these two -- atom connected to OH and CH3.

I think you are looking for Chem.CanonicalRankAtoms() described at 
http://rdkit.org/Python_Docs/rdkit.Chem.rdmolfiles-module.html#CanonicalRankAtoms
 .

>>> mol = Chem.MolFromSmiles("CC(O)CC")
>>> list(Chem.CanonicalRankAtoms(mol, breakTies=False))
[1, 4, 2, 3, 0]

As you wrote, there are no equivalent atoms.

>>> mol = Chem.MolFromSmiles("CC(O)C(C)C")
>>> list(Chem.CanonicalRankAtoms(mol, breakTies=False))
[2, 5, 3, 4, 0, 0]

this identifies the two identical methyl groups, both with rank 0.

You might try something like:

from collections import Counter
def get_atoms_neighbor_rank_counts(mol):
indices = []
ranks = list(Chem.CanonicalRankAtoms(mol, breakTies=False))
for atom in mol.GetAtoms():
  equivalents = Counter()
  equivalents.update(ranks[a.GetIdx()] for a in atom.GetNeighbors())
  counts = [count for rank, count in equivalents.most_common()]
  print("Neighbor symmetry counts for", atom.GetIdx(), "are", counts)

For your first test case it gives:

>>> mol = Chem.MolFromSmiles("CC(O)CC")
>>> get_atoms_neighbor_rank_counts(mol)
Neighbor symmetry counts for 0 are [1]
Neighbor symmetry counts for 1 are [1, 1, 1]
Neighbor symmetry counts for 2 are [1]
Neighbor symmetry counts for 3 are [1, 1]
Neighbor symmetry counts for 4 are [1]


That is, all of the atoms have unique neighbors.

For the second test case it gives:


>>> mol = Chem.MolFromSmiles("CC(O)C(C)C")
>>> get_atoms_neighbor_rank_counts(mol)
Neighbor symmetry counts for 0 are [1]
Neighbor symmetry counts for 1 are [1, 1, 1]
Neighbor symmetry counts for 2 are [1]
Neighbor symmetry counts for 3 are [2, 1]
Neighbor symmetry counts for 4 are [1]
Neighbor symmetry counts for 5 are [1]

That is, the atom with index 3 (the 4th atom) has 3 neighbors, two of which are 
identical.

Finally,


>>> mol = Chem.MolFromSmiles("CC(C)(C)OC(C)(C)C")
>>> get_atoms_neighbor_rank_counts(mol)
Neighbor symmetry counts for 0 are [1]
Neighbor symmetry counts for 1 are [3, 1]
Neighbor symmetry counts for 2 are [1]
Neighbor symmetry counts for 3 are [1]
Neighbor symmetry counts for 4 are [2]
Neighbor symmetry counts for 5 are [3, 1]
Neighbor symmetry counts for 6 are [1]
Neighbor symmetry counts for 7 are [1]
Neighbor symmetry counts for 8 are [1]

This says that the atoms with indices 1 and 5 each have 4 neighbor, 3 of which 
are identical, and that the atom with index 4 has 2 neighbors, both of which 
are identical.


Andrew
da...@dalkescientific.com




___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] available Python/RDKit training slots for the 2018 UGM

2018-09-04 Thread Andrew Dalke
Hi all,

  As you may know, I will be offering a free Python/RDKit training session 
before the UGM in a couple of weeks. This is a beginner level course for people 
with some programming experience. It will cover the basics of Python, RDKit, 
JupyterLab, Pandas, Scikit-Learn and more. The goal is to orient people, not 
provide a deep training.

Enough people were on the waiting list that we've decided to open up the 
training course, and have it once on Monday the 17th (9am to 4pm) and again on 
Tuesday (9am to 4pm).

We've been able to reschedule some people from the Tuesday slot to the Monday, 
so there is space on both days; roughly 6 slots on Monday and 4 on Tuesday.

If you are interested in participating in this free workshop, let me know. 
Please specify which day you want to attend, or if you have no preference.

You will need to bring your own laptop. I'll come back next week to let you 
know what software you need to have pre-installed.


Andrew
da...@dalkescientific.com


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] no structure depiction in Jupyter notebook

2018-08-31 Thread Andrew Dalke
On Aug 31, 2018, at 15:27, Axel Pahl  wrote:
> on Linux, using Anaconda, RDKit and Python 3.6, I always need to additionally 
> install cairocffi via pip:

Thanks Axel.

When I tried it out, I figured out the more likely problem - I was using 
jupyter from a non-conda virtualenv.

My problem disappeared once I switched to a new terminal window (without the 
virtual environment) and started a new notebook server.

Your suggestion was helpful because it made me realize that the "pip install" 
went to the wrong Python installation.

Cheers,


Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] no structure depiction in Jupyter notebook

2018-08-31 Thread Andrew Dalke
On Aug 31, 2018, at 11:58, Andrew Dalke  wrote:
> I am unable to see an inline structure depiction in the Jupyter notebook, nor 
> in the JupyterLab notebook, tested with both the Python 2 and Python 3 
> kernels, and rdkit.__version__ '2018.03.1'.

I've narrowed it down to the Cairo code. 

I ran Draw._moltoimg manually, and the d2d.GetDrawingText() call returns b"".

If I do

  del Draw.rdMolDraw2D.MolDraw2DCairo 

then the image generation code will fall back to MolToImage, and give me an 
inline depiction in the notebook.



Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] no structure depiction in Jupyter notebook

2018-08-31 Thread Andrew Dalke
Hi all,

  I am unable to see an inline structure depiction in the Jupyter notebook, nor 
in the JupyterLab notebook, tested with both the Python 2 and Python 3 kernels, 
and rdkit.__version__ '2018.03.1'.

I installed miniconda and RDKit on my Mac using:

curl -O 'https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh'
bash Miniconda3-latest-MacOSX-x86_64.sh
conda install conda-build
conda install -c conda-forge jupyterlab
conda install -c rdkit rdkit

My test code is:

from rdkit import Chem
from rdkit.Chem.Draw import IPythonConsole
mol = Chem.MolFromSmiles("c1c1O")
mol

It gives a broken image link, the contents of which are:

data:image/png;base64,

There is no message in the console log reporting a problem.

My assumption is that I am missing a library which is needed for the PNG 
generation step. I have put the output of "conda list" at the end of this email.

Can someone suggest what to do next? I did not see a discussion of this problem 
on the recent mailing list nor in the issue tracker.

As a backup, I saw that IPythonConsole supports SVG mode. I enabled it using:

IPythonConsole.ipython_useSVG = True

This gave me another broken image icon. I tracked it down to incorrect handling 
of the "svg" namespace. A fix was for this was made on 14 May 2018, and will be 
part of the fall release. (See https://github.com/rdkit/rdkit/pull/1866 ).

For now I can monkey-patch the run-time using:

===
from rdkit.Chem import Draw 
original_moltoSVG = Draw._moltoSVG

def new_moltoSVG(*args, **kwargs):
s = original_moltoSVG(*args, **kwargs)
return s.replace("xmlns:svg='http://www.w3.org/2000/svg'",
 "xmlns='http://www.w3.org/2000/svg'")

Draw._moltoSVG = new_moltoSVG
===

However, I would like to get the regular PNG output to work.

Cheers,


Andrew
da...@dalkescientific.com

# packages in environment at /Users/dalke/miniconda3:
#
# NameVersion   Build  Channel
appdirs   1.4.3  py_1conda-forge
appnope   0.1.0py36_0conda-forge
asn1crypto0.24.0   py36_0  
attrs 18.1.0 py_1conda-forge
automat   0.7.0  py_1conda-forge
backcall  0.1.0  py_0conda-forge
beautifulsoup44.6.3py36_0  
blas  1.0 mkl  
bleach2.1.4  py_1conda-forge
bzip2 1.0.6h1de35cc_5  
ca-certificates   2018.8.24ha4d7672_0conda-forge
cairo 1.14.12  hc4e6be7_4  
certifi   2018.8.24py36_1  
cffi  1.11.5   py36h342bebf_0  
chardet   3.0.4py36h96c241c_1  
conda 4.5.11   py36_0conda-forge
conda-build   3.13.0   py36_0  
conda-env 2.6.0h36134e3_0  
constantly15.1.0 py_0conda-forge
cryptography  2.2.2py36h1de35cc_0  
decorator 4.3.0  py_0conda-forge
entrypoints   0.2.3py36_2conda-forge
filelock  3.0.6py36_0  
fontconfig2.13.0   h5d5b041_1  
freetype  2.9.1hb4e5f40_0  
gettext   0.19.8.1 h15daf44_3  
glib  2.56.2   hd9629dc_0  
glob2 0.6  py36_0  
html5lib  1.0.1  py_0conda-forge
hyperlink 17.3.1 py_0conda-forge
icu   58.2 h4b95b61_1  
idna  2.6  py36h8628d0a_1  
incremental   17.5.0 py_0conda-forge
intel-openmp  2018.0.3  0  
ipykernel 4.9.0py36_0conda-forge
ipython   6.5.0py36_0conda-forge
ipython_genutils  0.2.0  py_1conda-forge
jedi  0.12.1   py36_0conda-forge
jinja22.10 py36_0  
jpeg  9b   he5867d9_2  
jsonschema2.6.0py36_2conda-forge
jupyter_client5.2.3  py_1conda-forge
jupyter_core  4.4.0  py_0conda-forge
jupyterlab0.34.6   py36_0conda-forge
jupyterlab_launcher   0.13.1 

Re: [Rdkit-discuss] descriptors beyond rotatable bond count and possible correlations with entropy

2018-08-31 Thread Andrew Dalke
On Aug 31, 2018, at 07:41, Paolo Tosco  wrote:
> this gist should do what you need:

Unless I misinterpreted what Jim is looking for, I don't think that returns the 
contiguous rotatable bonds in a small molecule.

In the following there are only two rotatable bonds:

>>> mol = Chem.MolFromSmiles("NCC1CCC1C(=O)O")
>>> mol.GetSubstructMatches(RotatableBondSmarts)
((1, 2), (5, 6))

The code from the gist identifies three bonds:

>>> find_bond_groups(mol)
((, , ),)
>>> [(b.GetBeginAtomIdx(), b.GetEndAtomIdx()) for b in _[0]]
[(5, 2), (5, 6), (1, 2)]

In this case, (2,5) is part of a ring, and not rotatable.

I think the problem is that:

   nbrs = [nbr.GetIdx() for nbr in a.GetNeighbors() if (
  (nbr.GetIdx() in rot_atom_set_tmp) and (not (nbr.GetIdx() in 
connected_atom_set)))]

finds that both atoms of the bond are in connected bond groups, while the bond 
itself is not part of the match.

I have put an alternative implementation of this function at the bottom of this 
email. For my test case it returns:

>>> find_bond_groups2(mol)
((,), (,))
>>> [[(b.GetBeginAtomIdx(), b.GetEndAtomIdx()) for b in x] for x in _]
[[(5, 6)], [(1, 2)]]

Cheers,

Andrew
da...@dalkescientific.com


from collections import defaultdict
def find_bond_groups_dalke(mol):
rot_atom_pairs = mol.GetSubstructMatches(RotatableBondSmarts)

bond_groups = defaultdict(set)
for (left, right) in rot_atom_pairs:
# Ensure they are in order (I'm not sure if this is required)
if right < left:
left, right = right, left
pair = (left, right)
# Add the pair to the group associated with the left atom
bond_groups[left].add(pair)
# Merge the left atom group with the right atom group
bond_groups[left].update(bond_groups[right])
bond_groups[right] = bond_groups[left]

# Get just the unique bond groups, in order from largest to smallest
unique_bond_groups = set(frozenset(v) for v in bond_groups.values())
sorted_bond_groups = sorted(unique_bond_groups, reverse=True, key=len)

# Return as bond objects, to match Paolo's code
return tuple(
tuple(mol.GetBondBetweenAtoms(left, right) for (left, right) in 
bond_group)
for bond_group in sorted_bond_groups)


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] want advice for good teaching data set

2018-08-30 Thread Andrew Dalke
Thanks for the responses. I'll merge them into one reply:


On Aug 29, 2018, at 16:56, Eloy Félix  wrote:
> If you want to build model I guess that what you want is to get experimental 
> logp values.
> 
> This should give you something to start with:
> 
> select ACTIVITY_ID, MOLREGNO, STANDARD_VALUE, STANDARD_TYPE from ACTIVITIES 
> where STANDARD_TYPE = 'LogP' and STANDARD_VALUE is not null and 
> data_validity_comment is null and POTENTIAL_DUPLICATE = 0;

Yes, that's what I was looking for, including the pointers for validity and if 
it might be a duplicate. Thanks!


On Aug 29, 2018, at 15:51, TJ O'Donnell  wrote:
> ChEMBL 24 has compound properties in the table compound_properties.  I think 
> the alogp
> is computed using (Crippen) atom types and the acd_logp is uses ACD labs 
> methods.

I can see I wasn't clear. I was looking for experimental data.

The ChEMBL blog post at 
https://chembl.blogspot.com/2018/05/chembl-24-released.html says that they 
switched to using RDKit for alogp; acd_logp is still from ACD.


On Aug 29, 2018, at 18:07, JW Feng via Rdkit-discuss 
 wrote:
> What about building QSAR models to predict activity for a particular ChEMBL 
> assay?  This would allow you to discuss strength and limitations of QSAR 
> models.


I am, primarily, a software developer working in computational chemistry. Do 
you want fast similarity search? I can do that. Do you want a maximum common 
structure algorithm, or matched molecular pair algorithm? I can do that. Do you 
want to tell me which parameters and learning algorithm you want to use? I can 
make the pieces go together.

What I don't have is the expertise to build a chemically relevant model on my 
own, and discuss its strength and weaknesses.

When I build a model, I do it to predict molecular weight. :)

Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] want advice for good teaching data set

2018-08-29 Thread Andrew Dalke
Hi all,

  I am starting to put together materials for the Python/RDKit training course 
I'm giving just before the RDKit UGM next month.

I would like to structure part of it around the SQLite release of the ChEMBL 
data set. More specifically, I plan to include examples of machine learning 
with scikit-learn, using RDKit descriptors and values from ChEMBL 24 (and 
making sure to use the new schema).

Two problems. First, I'm not a computational chemist and I don't know what 
would constitute a good example to use. "Good" in this case means one whose 
outlines are well-known to likely students. Second, I don't have much 
experience with the ChEMBL data.

My thought is to make a logP model. The easiest would be to based it on atom 
types. For this option, can anyone suggest where I can find logP data from 
ChEMBL?

Another possibility is to use a pre-existing model, like the notebook George 
Papadatos did for Ligand-based Target Prediction at 
http://nbviewer.jupyter.org/gist/madgpap/10457778 .

Perhaps someone here could point me to other existing resources along similar 
lines?

Best regards,

Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] cartridge license?

2018-08-23 Thread Andrew Dalke
On Aug 23, 2018, at 07:18, Roman Bolzern  wrote:
> Dear RDKittens,

I would prefer to not be called a 'kitten'.

> https://www.rdkit.org/docs/Cartridge.html#license, and at the bottom it says 
> “This work is licensed under the Creative Commons Attribution-ShareAlike 4.0 
> License”,
  ...
> Is “this work” only the code in this docs page? Or what does this mean for a 
> business use case? What exactly do we have to share, what not?

Before that sentence is the line "This document is copyright ..." At the top of 
the page it says: "This document is a tutorial and reference guide for the 
RDKit PostgreSQL cartridge." I regard that as saying that "this work" is a 
synonym for 'this document' and meant to refer to the tutorial and reference 
guide document.

To verify that, I looked at the source code, available online at 
https://github.com/rdkit/rdkit/tree/master/Code/PgSQL/rdkit . The files all 
appear to have the standard BSD license, and they do not mention the CC 
license. 

The FAQ at https://creativecommons.org/faq/ suggests "While we recommend 
against using a CC license on software itself, CC licenses may be used for 
software documentation, as well as for separate artistic elements such as game 
art or music." This supports the hypothesis that the Attribution-ShareAlike 4.0 
License is only meant to apply to the documentation.

Cheers,

Andrew
da...@dalkescientific.com


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] elimination of small fragments

2018-06-29 Thread Andrew Dalke
On Jun 29, 2018, at 02:43, 藤秀義  wrote:
> Although not strictly based on the number of atoms, but on the length of 
> SMILES string, the simplest way is using Python built-in functions as follows:
> 
> smiles = 'CCC.CC'
> fragment = max(smiles.split('.'), key=len)
> print (fragment)

The mmpdb package I helped develop includes some functions which work directly 
on the SMILES string. One of them uses a regular expression to count the number 
of heavy atoms, assuming the SMILES is valid and is written in 'normal' form, 
where the '.' is only used to distinguish between disconnected atoms (that is, 
things like "C1.C1" are not supported).

>>> smiles = "CCC.BrCl.[238U]"
>>> fragment = max(smiles.split('.'), key=len)
>>> fragment
'[238U]'

>>> from mmpdblib.fragment_algorithm import get_num_heavies_from_smiles
>>> fragment = max(smiles.split('.'), key=get_num_heavies_from_smiles)
>>> fragment
'CCC'

I keep meaning to put together a package with the various SMILES tricks that 
are possible without full chemistry perception.

On Jun 29, 2018, at 10:56, Ed Griffen  wrote:
> Using the string length to find the number of atoms in a molecule is OK - but 
> you need to take account of the additional characters in SMILES that are not 
> just atoms,
   ...
> Here’s a worked example:
> 
> >>> SMILES = 'C[S@@+]([O-])c1ccc(cc1)[Si](C)(C)C'
> >>> print(len(SMILES))
> 34
> >>> heavies = [char for char in SMILES if char not in 
> >>> '''()[]1234567890#:;,.?%-=+\/Hherlabdgfikmputvy@''']
> >>> print(len(heavies))
> 13

That's neat! But it doesn't always give the correct count.

>>> def count(smiles):
...   return sum(1 for c in smiles if c not in 
'''()[]1234567890#:;,.?%-=+\/Hherlabdgfikmputvy@''')
...
>>> count("[Hg]")
0
>>> count("[Zn]")
2
>>> count("[Tc]")
2
>>> count("[As]")
2

as well as for aromatic boron, as in:

>>> count("Cc1b[n+](C[n+]2cc(C)cc(C)c2)cc(C)c1")
16
>>> count("Cc1B[n+](C[n+]2cc(C)cc(C)c2)cc(C)c1")
17

and aromatic tellurium.

These came up in a cross-comparison I did using ChEMBL as a test set. I 
excluded records with [2H] and and [3H] because RDKit considers those to be 
heavy atoms while Ed's method does not.

The error rate is impressively low for such a simple approach, with only 467 
mismatches out of 1,726,695 cases.


Cheers,


Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] elimination of small fragments

2018-06-29 Thread Andrew Dalke
On Jun 28, 2018, at 22:08, Paolo Tosco  wrote:
> if you wish to keep only the largest disconnected fragment you may try the 
> following:
> 
> mols = list(rdmolops.GetMolFrags(mol, asMols = True))
> if (mols):
> mols.sort(reverse = True, key = lambda m: m.GetNumAtoms())
> mol = mols[0]

A somewhat simpler .. or at least shorter ... version is:

mols = rdmolops.GetMolFrags(mol, asMols = True)
mol = max(mols, default=mol, key=lambda m: m.GetNumAtoms())

The max() function goes through the molecules that GetMolFrag returns.

If the list is empty, it returns the 'default' value, which is the original 
molecule. (This is what Paolo's code does. Another option is to use None as the 
default value.)

Otherwise, since 'key' is specified, its value is used as a function to 
determine a value for each molecule. That is, for each term 'm' in the list of 
'mols', it computes m.GetNumAtoms(), and uses that return value to select an 
object with the maximum value.

In this case, it selects a molfrag output molecule with the most atoms.

I think I've just added a topic to cover for the upcoming Python/RDKit training 
session in September! :)

For those interested, remember to sign up soon.

Cheers,

Andrew
da...@dalkescientific.com


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] (Morgan) fingerprints for specific atom?

2018-06-18 Thread Andrew Dalke
Hi Андрей,

  The GetMorganFingerprint function takes additional parameters. From
http://rdkit.org/Python_Docs/rdkit.Chem.rdMolDescriptors-module.html#GetMorganFingerprint

GetMorganFingerprint( (Mol)mol, (int)radius
   [, (AtomPairsParameters)invariants=[]
[, (AtomPairsParameters)fromAtoms=[]
 [, (bool)useChirality=False
  [, (bool)useBondTypes=True
   [, (bool)useFeatures=False
[, (bool)useCounts=True
 [, (AtomPairsParameters)bitInfo=None]]])
-> UIntSparseIntVect : 
Returns a Morgan fingerprint for a molecule

Specify the radius with the 'radius' parameter.

The 'fromAtoms' takes a list of atom indices. If you want to get the circular 
fingerprint around atom 8 then pass in [8].

GetMorganFingerprintAsBitVect(...) is a similar function which returns a 
bitstring fingerprint instead of a count fingerprint.

Cheers,

Andrew
da...@dalkescientific.com



> On Jun 18, 2018, at 21:33, Андрей Парамонов  wrote:
> 
> Hello!
> 
> It is possible to get Morgan (circular) fingerprints for the molecule:
> GetMorganFingerprint(mol)
> but what is the best way to get fingerprints for a particular molecule
> atom and radius? I'm using Python RDKit bindings.
> 
> Best wishes,
> Andrey Paramonov



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] converting from ToBitString() to SMILES

2018-06-17 Thread Andrew Dalke
> On Jun 17, 2018, at 21:04, Raghuram Srinivas  
> wrote:
> Is there a way to convert a bit string of 2048 bits back to the SMILES / 
> BitVector representation of the molecule?  Any help /pointers in this 
> direction will be much appreciated . 

That topic came up on this list in April of this year in a thread titled "Any 
known papers on reverse engineering fingerprints into structures?"

I posted a rough draft of reversing a 2048-bit path fingerprint, in 
https://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg07861.html .

Others gave additional references.


Cheers,

Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] boron atom/element support in RDkit

2018-06-12 Thread Andrew Dalke
On Jun 12, 2018, at 18:00, Bennion, Brian via Rdkit-discuss 
 wrote:
> Does RDkit support boron in SMILES strings?  We have a number of compounds 
> for which rdkit parsing is not successful.  The commonality is that there is 
> a B or b listed in the string.  

RDKit supports boron, including aromatic boron.

>>> from rdkit import  Chem
>>> mol = Chem.MolFromSmiles("[B]")
>>> Chem.MolToSmiles(mol)
'[B]'
mol = Chem.MolFromSmiles("B")
>>> [a.GetAtomicNum() for a in mol.GetAtoms()]
[5]
>>> mol = Chem.MolFromSmiles("[b]1c1")
>>> Chem.MolToSmiles(mol)
'b1c1'
>>> mol = Chem.MolFromSmiles("b1c1")
>>> Chem.MolToSmiles(mol)
'b1c1'
>>> [a.GetAtomicNum() for a in mol.GetAtoms()]
[5, 6, 6, 6, 6, 6]

What is the error message? Can you post a reproducible?

Cheers,


Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Calculating MorganFingerprint Counts for large number of Molecules

2018-05-18 Thread Andrew Dalke
On May 18, 2018, at 17:48, Jennifer Hemmerich  
wrote:
> I really liked the idea and I implemented it as follows: 


> df = pd.DataFrame(columns=counts.keys())
> for i,fp in enumerate(allfps):
> logger.debug('appending %s', str(i))
> df.append(fp, ignore_index=True)


While this saves a bit over creating an (N=100,000 x 2**32) data frame, it 
still creates a dataframe with (N x #unique keys).

This is probably where your memory is going.

Your next operation selects only those with a given number of fingerprints in 
it, and throws away the rest of the data.

> df = df[df.columns[df.sum(axis=0) >= cut]]

What you can do instead is to select only those keys with at least cut 
elements, before making the DataFrame.

One way to do that is to use the collections.Counter class, which is like a 
collections.defaultdict(int) with the addition that it implements a 
"most_common()" method.

When called by with no arguments it returns the list of (key, value) pairs, 
sorted from largest value to smallest.

That means you can use a loop like:

selected_keys = []
for idx, count in counts.most_common():
if count < cut:
break
selected_keys.append(idx)

to select only the keys which are at or above the cut value.

Once you have those keys, you can create the rows which are then passed to the 
DataFrame.

The following seems to do what you are looking for. At the very least, it gets 
you more of the way there.




from rdkit.Chem import AllChem

from collections import Counter
import pandas as pd

def get_ecfp(mols, n=3, cut=10):
allfps = []
counts = Counter()
for i, mol in enumerate(mols):
fp = AllChem.GetMorganFingerprint(mol, n).GetNonzeroElements()
allfps.append(fp)
for idx, count in fp.items():
# use "+= count" for the most frequent feature"
counts[idx] += count
# use "+= 1" for the feature in the most fingerprints
#counts[idx] += 1

selected_keys = []
for idx, count in counts.most_common():
if count < cut:
break
selected_keys.append(idx)

rows = []
for fp in allfps:
rows.append([fp.get(key, 0) for key in selected_keys])

df = pd.DataFrame(
rows,
columns=selected_keys)

return df

if __name__ == "__main__":
mols = []
for smi in ("c1c1O", "CNO", "C#N"):
mol = AllChem.MolFromSmiles(smi)
mols.append(mol)

print(get_ecfp(mols, cut=2))



Note that I construct a list of rows which I then pass in to the DataFrame at 
once, rather then append() one row at a time as your code does.

This is because the Pandas documentation says, at 
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html
 : Iteratively appending rows to a DataFrame can be more computationally 
intensive than a single concatenate. A better solution is to append those rows 
to a list and then concatenate the list with the original DataFrame all at once.


Cheers,


Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKIT 2018.3 and MMPDB problem

2018-05-09 Thread Andrew Dalke
And I have uploaded a source tar.gz and a binary wheel to PyPI.

That means you can do "pip install mmpdb" to install this most recent version.

Andrew
da...@dalkescientific.com



> On May 9, 2018, at 18:04, Kramer, Christian  
> wrote:
> 
> Dear all,
> 
> Andrew's fix for the wildcard atom representation in RDKit 2018.3 has just 
> been incorporated into the main mmpdb branch. If you download the latest 
> version of mmpdb, it should now work with with the current version and also 
> with previous RDKit versions.
> 
> Note that you have to rebuild your mmp database if you switch from a pre 2018 
> version of RDKit to a 2018+ version.
> 
> Bests,
> Christian
> 
> Dr. Christian Kramer
> 



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKIT 2018.3 and MMPDB problem

2018-05-08 Thread Andrew Dalke
Dear Marco,

> On May 7, 2018, at 23:59, Marco Stenta  wrote:
> I had some time to set an environment for it and test it: it works fine, as 
> far as my tests go. I will switch to this version and to the latest RDKIT now.

Thanks for the feedback. Someone else sent me a private email also saying it 
worked. I'll put together the final changes for a 2.1 release, and see about 
making it accessible from PyPI so "pip install mmpdb" will work.

> some questions:
> Is there any plan to:
> include MCS as a fragmentation method?
> extend to matched series?
> include "fuzzy" environment definitions based on pharmacophores (as BI people 
> did)?

I know of no plans for that.

As a consultant, my answer is more along the lines of "are you willing to pay 
for it?" :)

It won't be cheap.

Speaking of which, many thanks to Roche because they funded this work, just 
like that previously funded me to develop the MCS code that is now in RDKit.


> I am currently using the database file to explore the rules mainly via visual 
> inspection through spotfire by means of a series of joins to generate 
> suitable tables: would anybody be interested in this (also helping improving 
> it)?

I hope you find others and are able to get this out there.

One of the reasons for developing what eventually became mmpdb 2.0 is to make 
this sort of viewer possible.

That is, the unreleased mmpdb 1.0 didn't have canonical attachment point 
assignment, resulting in up to 6 different ways to represent a 3-cut fragment. 
This made it technically challenging to the simple sorts of analyses that mmpdb 
2.0 does now, and impossible to visualize meaningfully.

Cheers,

Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKIT 2018.3 and MMPDB problem

2018-05-06 Thread Andrew Dalke
On Apr 27, 2018, at 00:20, Andrew Dalke <da...@dalkescientific.com> wrote:
> Please try out:
>  http://dalkescientific.com/mmpdb-2.1b1.tar.gz
> 
> or my fork at:
>  https://github.com/adalke/mmpdb
> 
> and let me know of any problems.

Has anyone downloaded and tested this code? I would like to push it to the main 
repository and start the process of doing a new release.

Cheers,

Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKIT 2018.3 and MMPDB problem

2018-04-27 Thread Andrew Dalke
On Apr 27, 2018, at 00:20, Andrew Dalke <da...@dalkescientific.com> wrote:
> It does not appear that the .fragment files also need to be redone, so 
> rebuilding the .mmpdb file is mostly a matter of re-running the index step.

I no longer think that is correct. While indexing will work, the resulting MMP 
database can no longer be used for making predictions.

When you upgrade to RDKit 2018, you'll need to rebuild your mmpdb data sets 
from scratch.

The same will be true any time the RDKit canonicalization algorithm changes.


Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKIT 2018.3 and MMPDB problem

2018-04-26 Thread Andrew Dalke
On Apr 26, 2018, at 12:38, Andrew Dalke <da...@dalkescientific.com> wrote:
> The automated mmpdb test suite isn't that good, so I still need to do some 
> manual testing. I won't be able to get to this until (hopefully) this evening.

I did that, and tracked down one more bug.

Please try out:
  http://dalkescientific.com/mmpdb-2.1b1.tar.gz

or my fork at:
  https://github.com/adalke/mmpdb

and let me know of any problems.

NOTE! Because of this change, anyone who switches from a pre-2018 RDKit to the 
most recent RDKit must rebuild the .mmpdb files.

It does not appear that the .fragment files also need to be redone, so 
rebuilding the .mmpdb file is mostly a matter of re-running the index step.




Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKIT 2018.3 and MMPDB problem

2018-04-26 Thread Andrew Dalke
On Apr 26, 2018, at 10:09, Marco Stenta  wrote:
> 
> Dear Colleagues,
> I just installed on conda env the new rdkit version
>  and wanted to try mmpdb but upon testing I got the error below
> reverting back to rdkit=2017.09.3.0 it works fine (I still get some errors 
> but it goes thrugh)
> 
> It is a bug, am I doing anything obviously wrong?

You did nothing wrong.

The latest version of RDKit changed the SMILES output for wildcard atoms from 
"*" to "[*]".

I missed this change in the release notes. It's further documented at 
  https://github.com/rdkit/rdkit/pull/1788

The mmpdb code does a lot of manipulation of the SMILES at the syntax level, 
and parts of the code expected that wildcard atoms would only be in the form 
"[*]".

I've put together a first go at a fix, which should work for both older and the 
newest versions of RDKit.

The automated mmpdb test suite isn't that good, so I still need to do some 
manual testing. I won't be able to get to this until (hopefully) this evening.

Until then, if you could try out

  http://dalkescientific.com/mmpdb-2.1b1.tar.gz
or my fork of the repository at
  https://github.com/adalke/mmpdb

and let me know if there are failures, that would help me with the testing.


Cheers,


Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDKIT 2018.3 and MMPDB problem

2018-04-26 Thread Andrew Dalke
On Apr 26, 2018, at 10:09, Marco Stenta  wrote:
> 
> Dear Colleagues,
> I just installed on conda env the new rdkit version
> and wanted to try mmpdb but upon testing I got the error below
> reverting back to rdkit=2017.09.3.0 it works fine (I still get some errors 
> but it goes thrugh)
> 
> It is a bug, am I doing anything obviously wrong?

You did nothing wrong.

The latest version of RDKit changed the SMILES output for wildcard atoms from 
"*" to "[*]".

I missed this change in the release notes. It's further documented at 
 https://github.com/rdkit/rdkit/pull/1788

The mmpdb code does a lot of manipulation of the SMILES at the syntax level, 
and parts of the code expected that wildcard atoms would only be in the form 
"[*]".

I've put together a first go at a fix, which should work for both older and the 
newest versions of RDKit.

The automated mmpdb test suite isn't that good, so I still need to do some 
manual testing. I won't be able to get to this until (hopefully) this evening.

Until then, if you could try out

 http://dalkescientific.com/mmpdb-2.1b1.tar.gz
or my fork of the repository at
 https://github.com/adalke/mmpdb

and let me know if there are failures, that would help me with the testing.


Cheers,


Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] 2018.03.1 RDKit release

2018-04-25 Thread Andrew Dalke
On Apr 25, 2018, at 01:31, David Hall  wrote:
> You need to turn off RDK_INSTALL_INTREE

Thanks! I've put that my build notes for the next time I compile RDKit.

BTW, a quick benchmark of the new release shows that it's almost 15% faster at 
parsing SMILES strings than 2016.09.3

That is, parsing the SMILES strings from ChEBI went from 17.0 to 14.6 seconds.

Nice work!

Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] 2018.03.1 RDKit release

2018-04-24 Thread Andrew Dalke

> On Apr 23, 2018, at 10:43, Greg Landrum  wrote:
> 
> I'm pleased to announce that the next version of the RDKit - 2018.03 - is 
> released. The release notes are below.
  ...
> Please let me know if you find any problems with the release or have 
> suggestions for the next one, which is scheduled for Octobera 2018.

I'm likely one of the few who builds RDKit manually. I also build on a Mac but 
I don't use the system/homebrew Python or Boost but rather have my own 
installations under a virtual environment. Which makes updates "fun".

The build step produced errors like:

  "links to target 'Boost::serialization' but the target was not found."
  "links to target 'Boost::python3' but the target was not found."

The solution was to upgrade my cmake from 3.7.2 to 3.11.1.

Note however that the Install.md says:

   cmake. You need version 3.1 (or more recent)

This should be updated. I don't know which cmake version is the minimum 
required.



By the way, how do I install RDKit into a specified location? I usually expect 
something like --prefix /usr/local, and there's a CMAKE_INSTALL_PREFIX which 
defaults to "/usr/local" but "make install" puts it in the build directory, like

  /Users/dalke/ftps/rdkit-Release_2018_03_1


My solution is to install it myself, like:


cd ~/venvs/py36-2018-4/lib/python3.6/site-packages/
cp -rp ~/ftps/rdkit-Release_2018_03_1/rdkit .
python -m compileall rdkit
cd ~/venvs/py36-2018-4/lib
cp -p ~/ftps/rdkit-Release_2018_03_1/lib/lib* .


but the shared libraries uses a relative path:

  File 
"/Users/dalke/venvs/py36-2018-4/lib/python3.6/site-packages/rdkit/__init__.py", 
line 2, in 
from .rdBase import rdkitVersion as __version__
ImportError: 
dlopen(/Users/dalke/venvs/py36-2018-4/lib/python3.6/site-packages/rdkit/rdBase.so,
 2): Library not loaded: @rpath/libboost_python36.dylib
  Referenced from: 
/Users/dalke/venvs/py36-2018-4/lib/python3.6/site-packages/rdkit/rdBase.so
  Reason: image not found


% otool -L 
/Users/dalke/venvs/py36-2018-4/lib/python3.6/site-packages/rdkit/rdBase.so
/Users/dalke/venvs/py36-2018-4/lib/python3.6/site-packages/rdkit/rdBase.so:
@rpath/libRDKitRDBoost.1.dylib (compatibility version 1.0.0, current 
version 2018.3.1)
@rpath/libboost_python36.dylib (compatibility version 0.0.0, current 
version 0.0.0)
@rpath/libboost_serialization.dylib (compatibility version 0.0.0, 
current version 0.0.0)
@rpath/libRDKitRDGeneral.1.dylib (compatibility version 1.0.0, current 
version 2018.3.1)
/usr/lib/libc++.1.dylib (compatibility version 1.0.0, current version 
307.4.0)
/usr/lib/libSystem.B.dylib (compatibility version 1.0.0, current 
version 1238.0.0)


which means I've always had to tweak my virtualenv 'activate' command to 
set/unset DYLD_LIBRARY_PATH.

(It looks like there are ways to tweak the @rpath but I haven't figured it out.)

I would prefer to have something like

  cmake . --prefix ~/venvs/py36-2018-4
  make install

and skip all of this by-hand tweaking.

Does it exist already and I just missed it?


Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-23 Thread Andrew Dalke
On Apr 23, 2018, at 14:54, Brian Cole  wrote:
> Unfortunately it doesn't work on circular/ECFP-like fingerprints.

To be fair, you didn't mention that was a requirement. ;)

> It has the requirement that the fingerprint be a substructure fingerprint as 
> you described.

Could you elaborate on your goal?

I used RDKitFingerprint because it was the easiest. It was something I could do 
in a day to demonstrate that it is possible to reverse engineer some 
fingerprints.

I think it's possible to do something similar for circular fingerprints. It 
would mean generating all possible subgraphs of a given radius, which is doable 
for r=2 or r=3, and probably r=4. RDKit has a way to look at the circular 
environment around a a specific atom rather than the entire fingerprint, so 
that can be used to generate a seed point. Once that's found, it can be 
expanded to one of its neighbor atoms.

Another problem is that the Morgan fingerprint algorithm really wants sanitized 
structures, which I didn't need to worry about for the hash fingerprints.

Instead of a day of work, it's going to take a couple of weeks of work, which 
requires time and money.

My advice though is that it's surely possible to determine some structure 
information from the circular fingerprint. If your use case says there should 
be no information leak (other than what's possible by full 
brute-force-enumeration) then don't exchange fingerprints.

But leaking information is not really what I thought of by "reverse engineer".

For example, if I want to check if any of the Morgan fingerprints with r=2 
contain a phenol, I can ask RDKit to generate the fingerprint for r=2 using 
just the c(O)c as the fromAtoms. This gives:


% echo '*c1ccc(O)cc1 phenol' | rdkit2fps --from-atoms 3,4,5,6 --morgan
#FPS1
#num_bits=2048
#type=RDKit-Morgan/1 radius=2 fpSize=2048 useFeatures=0 useChirality=0 
useBondTypes=1 fromAtoms=3,4,5,6
#software=RDKit/2016.09.3 chemfp/3.2.1
#date=2018-04-23T14:03:24
00028200140044000200
phenol


I can then screen using that fingerprint to see which fingerprint match.

Of the first 100,000 structures in ChEMBL, 2216 contain phenol, all of the are 
detected by this screen, and there are no false positives.

Poof - structural information leakage.

The code is at the bottom of this email. It depends on the commercial version 
of chemfp.


> It seems the evolutionary/genetic algorithm approach is the current 
> state-of-the-art for decoding circular/ECFP-like fingerprints.

Dave Cosgrove mentioned Dave Weininger's GA work, which means it was with 
Daylight hash fingerprints. I don't think we know that GAs have ever been used 
to reverse engineer circular fingerprints.


> Historical question for you since you're the closest we have to a 
> chem-informatician historian. :-) Why did these circular/ECFP fingerprints 
> come into existence?

I believe you are asking for https://pubs.acs.org/doi/abs/10.1021/ci100050t .

  Extended-connectivity fingerprints (ECFPs) are a novel class of topological
  fingerprints for molecular characterization. Historically, topological
  fingerprints were developed for substructure and similarity searching.
  ECFPs were developed specifically for structure−activity modeling.

> my reading of the current literature is that tree/dendritic are statistically 
> just as good at virtual screening as circular/ECFP: 

Yeah, I don't go there. I leave concepts like "just as good" or "better" to 
people who have experimental data they can use for the comparison.


Andrew
da...@dalkescientific.com

== Code to find which Morgan fingerprints contain a phenol substructure ==

import chemfp
from chemfp import bitops, search

arena = chemfp.load_fingerprints("chembl_23_morgan.fps", reorder=False)
print("Fingerprint type:", arena.metadata.type)

# Want to find structures containing phenol

# Adjust the fingerprint type to limit it to the given atoms
fptype = chemfp.get_fingerprint_type(arena.metadata.type + " fromAtoms=3,4,5,6")
query_fp = fptype.parse_molecule_fingerprint("*c1ccc(O)cc1", "smi")

print("Query fingerprint:")
print(bitops.hex_encode(query_fp))
print()

# Find the matching fingerprints
result = search.contains_fp(query_fp, arena)

circular_ids = set(result.get_ids())

# Search the first 100,000 structures
from rdkit import Chem
from chemfp import rdkit_toolkit as T

pat = Chem.MolFromSmarts("*c1ccc(O)cc1")
all_ids = set()
exact_ids = set()

Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-22 Thread Andrew Dalke
On Apr 22, 2018, at 20:22, Nils Weskamp  wrote:
> Actually, I *was* also thinking about your use cases 2 and 3 since you
> also need some form of hash function to map substructures to bit
> numbers. This is normally a rather simple function / pseudo random
> generator, 

Strictly speaking, this is not a requirement.

The term "fingerprint" has taken on quite an encompassing meaning since 1990.

The molecular formula is a count fingerprint with 118 keys, based on the atomic 
number. There's no need for hash function there. "CCO" might be:
  [0, 0, 0, 0, 0, 2, 0, 1, ...]

Or it can be written in more compact form like {"C": 2, "O": 1}.

As an alternative, I could use a mapping from canonical substructures to 
counts, so "CCO" becomes:

  {"C": 2, "O": 1, "CC": 1, "CO": 1, "CCO": 1}

This doesn't require a hash. (While I represent that as a Python dictionary, 
which uses a hash table underneath, it could be implemented using a red-black 
tree or B-tree, or with a simple linear search.)

It's only if I want to convert this into fixed length representation where I 
have to figure out some sort of encoding scheme.

Even then, I don't need a PRNG or hash seed. Suppose I use a bit vector. I 
could have a table which maps all canonical substructures to its bit pattern. 
If I have an unknown fragment, I could use RANDOM.ORG to get the bits.

Downsides include potentially unbounded table growth and the need for a 
centralized table.

This is the approach that Zatocoding used, and I see Chemical Zatocoding as the 
only precursor to Daylight hash fingerprints.

>  which could of course also be changed to something expensive to calculate.


Yes, that could be possible. Abstractly, let the first 20 bytes of each 
fingerprint be a salt, and use something like bcrypt so each fingerprint test 
requires that the query structure be re-fingerprinted for the per-fingerprint 
hash function.

It would, however, take an absurdly long time to do a similarity search.

And in any case, before going further along that path, we would need to figure 
out the risk model. Brian started by saying that he wanted to obfuscate 
molecules for security, but didn't say what he want to use them for, and if he 
want to secure them against nation-states, or simply against me. ;)



Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-22 Thread Andrew Dalke
On Apr 22, 2018, at 08:42, Nils Weskamp  wrote:
> Nice work. If brute-force approaches like this (or methods based on
> genetic algorithms etc.) are the only way to reverse a fingerprint, one
> could probably come up with a fingerprint that allows for pretty secure
> structure sharing by using many iterations of a strong cryptographic
> hash function that is really slow to calculate.

I usually think of "brute force" as something more like "enumerate all possible 
structures, generate the fingerprint for each one, and compare." I think of 
what I did here as a bit more elegant than that. ;)

I think of fingerprints as being applied to three uses:

1) identity test
if mol1 == mol2 then fp(mol1) == fp(mol2)
=> if fp(mol1) != fp(mol2) then mol1 != mol2

A good identity fingerprint has the properties that the false positive rate of
 fp(mol1) == fp(mol2) but mol1 != mol2
is low, and that working with fingerprints gives advantages over a graph 
isomorphism check.

2) Substructure screening
  if mol1 is a subgraph of mol2 then fp(mol1) is a subset of fp(mol2).
(There are different definitions of 'subset' for different fingerprint 
types.)

  An effective substructure screen fingerprint has the properties that:
  fp(mol1) is a subset of fp(mol2) => a higher likelihood that mol1 is a 
subgraph of mol2
   -and-
  fingerprint subset test is faster than subgraph isomorphism testing

3) Similarity comparison
  Use similarity of fp(mol1) and fp(mol2) as a proxy/estimate of the similarity 
of mol1 and mol2.

Usually we also assume that computing the similarity of two fingerprints is 
fast.


A cheminformatics fingerprint usually supports #1 and one or both of #2 or #3. 

If we only need #1, which is the use case Nils brought up, then we could use a 
SHA256 of the canonical SMILES string, or use the InChI Key. These are 
fixed-length binary fingerprints which can only be used for identity testing, 
and would give a low false positive rate.

The structure leakage comes from needing support for #2 and/or #3.

I don't see any reasonable way to make a fingerprint that can do #2 and/or #3 
without being open to some sort of enumeration scheme that is more clever than 
brute force.

Possibly some scheme related to homomorphic encryption might work? As I 
understand it, this would be unreasonable slow for what most people expect from 
fingerprints.

Cheers,

Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-21 Thread Andrew Dalke
On Apr 21, 2018, at 01:55, Andrew Dalke <da...@dalkescientific.com> wrote:
> Hand-waving sketch: start with a carbon. Generate fingerprint. It should pass 
> the screening test. If not, the structure contains no carbons, so repeat with 
> other elements until you find an atom which passes. Successively either add 
> an atom+bond or connect two existing atoms with a bond, fingerprint the 
> result, and do the screening test. If it does not pass then that modification 
> was not permitted. Use a breadth-first search which prioritizes branching and 
> rings to avoid chains longer than the maximum enumeration size.

Here's an implementation of that sketch, applied to the RDKit hash fingerprint:

  http://dalkescientific.com/rev_eng_fp.py

It works well for small structures:

% python rev_env_fp.py
No SMILES given. Using caffeine.
Current best guess is C=C with 2 bits of 759
Current best guess is Cc=O with 6 bits of 759
Found! Cn1c(=O)c2c(ncn2C)n(C)c1=O

Here's aspirin:

% python rev_env_fp.py 'O=C(C)Oc1c1C(=O)O'
Found! CC(=O)Oc1c1C(=O)O

Capsicum is close, only missing a methyl in the tail.

% python rev_env_fp.py 'O=C(NCc1cc(OC)c(O)cc1)C=CC(C)C'
Current best guess is CNC(=O)C=CC(C)C with 100 bits of 384
Current best guess is CC=CC(=O)NCc1ccc(O)c(c1)OC with 376 bits of 384
Best guess is CC=CC(=O)NCc1ccc(O)c(c1)OC with 376 bits of 384


For omeprazole it only finds half of the structure:

% python rev_env_fp.py 'COc1ccc2nc([nH]c2c1)S(=O)Cc1ncc(C)c(OC)c1C'
Current best guess is Cc1c(C[SH]=O)ncc(C)c1OC with 469 bits of 863
Best guess is Cc1c(C[SH]=O)ncc(C)c1OC with 469 bits of 863

For estradiol it gets stuck finding another cyclohexane instead of the 
cyclopentane:

% python rev_env_fp.py 'C[C@]12CC[C@@H]3c4ccc(cc4CC[C@H]3[C@@H]1CC[C@@H]2O)O'
Current best guess is CC12CCC(O)C21C with 163 bits of 583
Current best guess is CC12(C1)C1c3ccc(O)cc3CCC1C2 with 477 bits of 583
Best guess is CC12(C1)C1c3ccc(O)cc3CCC1C2 with 477 bits of 583


Note: it's currently set up to only consider the elements
  ["C", "c", "O", "o", "N", "n", "S", "s", "F", "Cl", "Br"]

Edit the 'elements' list of you want to include more possibilities. This is 
more likely to run into a dead-end.


The current code assumes that when I grow by one atom, if fp(mol + new atom) is 
a subset of the target fingerprint, then mol + new_atom is a subgraph of the 
target structure.

This can be resolved by setting up a search tree, but then it needs to be more 
careful about backtracking and pruning, and that's too much work for an evening 
of programming.

Cheers,


Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Any known papers on reverse engineering fingerprints into structures?

2018-04-20 Thread Andrew Dalke
On Apr 20, 2018, at 19:03, jeff godden  wrote:
> 
> Long ago molecular fingerprints were referred to in the literature as 
> molecular hash functions. (y'know, those crazy mathematical algorithms which 
> permitted rapid lookup of some string in a lookup table)

Do you have a reference or any more detail for that?

As far as I can tell, effectively everyone used fragment-based screens from the 
punched-card era of the 1940s up until the late 1980s. It wasn't until the 
early 1990s when the word "fingerprint" appears in the cheminformatics 
literature, in a paper by John Barnard referencing the Daylight fingerprints.

I haven't come across any use of molecular hash functions for fingerprint-like 
descriptors before the Daylight work, and I looked pretty hard for one.

Instead, I get the feeling that there was some attempt in the 1990s to 
distinguish between these two approaches, and then over time the term 
"fingerprint" took on a much broader meaning than "Daylight's binary 
fingerprint based on hash-encoding of enumerated linear subgraphs." That is, I 
think that "molecular hash functions" post-date "fingerprint".

FWIW, a pubs.acs.org search for "molecular hash" found only three papers; one 
from 2005 and the other two from 2015.

The closest exception to to an earlier Daylight-like fingerprint is the 
superimposed coding created by Mooers in the 1940s, mentioned in Ray and Kirsch 
(1957), put to use in Feldman and Hodes (1975), and further used at a couple of 
places since then.

These *are* connected; Mooers in his "Codes and Coding" entry from 
"Encyclopedia of Library and Information Science" remarks: A rudimentary form 
of use of random patterns was discovered by computer people about 10 years 
later for fast look-up in tables. This simpler form, which effectively uses 
only a single descriptor, is known in the computer industry as "hash coding."" 
Weininger and Mooers both drew from the same information theory concepts to 
develop fingerprints and superimposed coding, respectively.

But they aren't the same.

I found hashing used for other molecular-related topics, like WLN and IR 
spectra lookups, but these didn't seem to be *molecular* hash functions.


>  As such, we expected for their to be the associated hash collisions  
> (https://en.wikipedia.org/wiki/Hash_table#Collision_resolution ).

As Peter Shenkin pointed out, this isn't a given.

In the MMP code I helped develop (mmpdb - https://github.com/rdkit/mmpdb ), one 
of the novel features is the ability to match not just the pairs but the local 
attachment environment, based on the circular environment of r=1 up to r=5 from 
the attachment point.

I created a fingerprint for that based on the fingerprints for the individual 
circular environments, concatenated together, and then SHA256'ed to get a 
unique characteristic.

Unlike the Daylight approach, this fingerprint can only be used to check for 
identity. The requirement that a fingerprint be used for similarity and 
substructure screens makes them much larger than needed for a simple identity 
check.

And as Dave Cosgrove rightly pointed out, this extra information makes it 
possible to reverse engineer a (Daylight-style hash fingerprint) to find a 
molecular graph which is at least isospectral to the original structure.

Hand-waving sketch: start with a carbon. Generate fingerprint. It should pass 
the screening test. If not, the structure contains no carbons, so repeat with 
other elements until you find an atom which passes. Successively either add an 
atom+bond or connect two existing atoms with a bond, fingerprint the result, 
and do the screening test. If it does not pass then that modification was not 
permitted. Use a breadth-first search which prioritizes branching and rings to 
avoid chains longer than the maximum enumeration size.

You'll also need to allow aromatic atoms in a non-ring so you can do the growth 
correctly. 

ECFP-style circular fingerprints are not designed for substructure screens so 
cannot be reverse-engineered this easily. It would be interested to try the GA 
method that Dave Cosgrove suggested.

I know of no papers concerning this topic, and I doubt that Dave Weininger ever 
published anything about it. He wasn't much into publishing in the scientific 
literature. 


Going back to the mmpdb environment fingerprint, it was also designed so that 
organizations can feel a bit more comfortable sharing MMP data with other 
organizations, since (like the InChI-Key) it's not possible to guess what an 
mmpdb environment fingerprint describes unless you 1) have it already, or 2) 
are willing to brute-force reasonable chemistry space to find it.

Interestingly, this use of "fingerprint" is more closely aligned with Rabin's 
1981 work on fingerprints - cryptographic hashes used for identify checking; 
see http://www.xmailserver.org/rabin.pdf - than with Daylight fingerprints.

When I asked a few years ago, Dave Weininger did not recall how they 

Re: [Rdkit-discuss] issue during parsing a smile

2018-04-16 Thread Andrew Dalke
On Apr 16, 2018, at 16:29, Guillaume GODIN  
wrote:
> And for this one C[C@@]12CC[C@@](C)(CC1)O2O any idea
> 
> Cause your tool failed too.

It's true that smiview failed, in the sense that it shouldn't have tried to do 
further analysis with a molecule that RDKit rejected.

However, RDKit does report the failure as "Explicit valence for atom # 8 O, 3, 
is greater than permitted":

% smiview 'C[C@@]12CC[C@@](C)(CC1)O2O'
[16:45:37] Explicit valence for atom # 8 O, 3, is greater than permitted
Unable to generate a molecule: RDKit cannot parse the SMILES
Traceback (most recent call last):
  File "/Users/dalke/venvs/py36-all/bin/smiview", line 11, in 
load_entry_point('smiview', 'console_scripts', 'smiview')()
   failure report that smiview should have prevented from happening ...


A hypothetical future version of smiview will capture that RDKit error message 
so it can pinpoint the error location more visually.

Cheers,


Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] issue during parsing a smile

2018-04-16 Thread Andrew Dalke
If you try this out with my smiview package, available from 
https://bitbucket.org/dalke/smiview/downloads/ , it reports:


% smiview 'C\(C(C)C)=N/O'
Cannot parse --smiles: Unexpected term
  C\(C(C)C)=N/O
^ Tokenizing stopped here
A bond must be followed by an atom, closure.

That is, the bond symbol '\' may not be followed by a branch. (And I should 
improve that error message so it's more explicit.)

I think you are looking for the SMILES "C(\C(C)C)=N/O".

Andrew
da...@dalkescientific.com





> On Apr 16, 2018, at 16:15, Guillaume GODIN  
> wrote:
> 
> Dear All,
>  
> Anyone know why this smile is not parse correctly ?
>  
> C\(C(C)C)=N/O
>  
> Best regards,
>  
> Guillaume
> ***
> DISCLAIMER  
> This email and any files transmitted with it, including replies and forwarded 
> copies (which may contain alterations) subsequently transmitted from 
> Firmenich, are confidential and solely for the use of the intended recipient. 
> The contents do not represent the opinion of Firmenich except to the extent 
> that it relates to their official business.  
> ***
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! 
> http://sdm.link/slashdot___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] reassembling a molecule from R-groups

2018-04-16 Thread Andrew Dalke
On Apr 16, 2018, at 05:37, Patrick Walters  wrote:
> 
> Thanks Andrew, the SMILES approach seemed to have quite a few edge cases so I 
> wrote something to work directly on a molecule. 

That's the approach I started with, until I figured out that it doesn't 
preserve chirality.

If I change the end of your code to:

==
from mmpdblib import smiles_syntax

def weld_dalke(core, r_groups):
s1 = smiles_syntax.convert_labeled_wildcards_to_closures(core)
s2 = smiles_syntax.convert_labeled_wildcards_to_closures(r_groups)
return Chem.CanonSmiles(s1+"."+s2)

if __name__ == "__main__":
mol_to_weld = Chem.MolFromSmiles(
"[*:1][C@](F)(Cl)O.N[*:1]")
welded_mol = weld_r_groups(mol_to_weld)
print("Expected  :", Chem.CanonSmiles("N[C@](F)(Cl)O"))
print("Direct:", Chem.MolToSmiles(welded_mol, isomericSmiles=True))
print("Via SMILES:", weld_dalke("[*:1][C@](F)(Cl)O", "N[*:1]"))
==

These should print identical SMILES strings, but instead give:

Expected  : N[C@](O)(F)Cl
Direct: N[C@@](O)(F)Cl
Via SMILES: N[C@](O)(F)Cl


If chirality preservation isn't a concern, then there's no problem.

BTW, your current code assumes there will only be one attachment point on an 
atom. For example, the input
 [*:1][C@]([*:2])(Cl)O.N[*:1].F[*:2]
create the output
 N.O[C](F)Cl

It's not hard to fix, and I think more of a d'oh! issue.


In a quick benchmark I put together just now, I found that my SMILES syntax 
manipulation approach was about twice as fast to turn the two core/R-group 
SMILES strings into a molecule.

Cheers,


Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] reassembling a molecule from R-groups

2018-04-15 Thread Andrew Dalke
Hi Pat,

  I wrote something like this for mmpdb, which is the MMPA code I helped 
develop, at https://github.com/rdkit/mmpdb .

It has one restriction, which I'll get to in a moment.

The general idea is to convert the attachment points to closures, join them 
with a ".", and canonicalize:

>>> from mmpdblib import smiles_syntax
>>> s1 = 
>>> smiles_syntax.convert_labeled_wildcards_to_closures("CN(C)CC(Br)c1cc([*:2])c([*:1])cn1")
>>> s1
'CN(C)CC(Br)c1cc%92c%91cn1'
>>> s2 = 
>>> smiles_syntax.convert_labeled_wildcards_to_closures("[H]C([*:1])([H])[H].[H][*:2]")
>>> s2
'[H]C%91([H])[H].[H]%92'
>>> from rdkit import Chem
>>> Chem.CanonSmiles(s1+"."+s2)
'Cc1ccc(C(Br)CN(C)C)nc1'

The smiles_syntax.py file does not use any of the rest of the code.


The restriction is that the code as-is assumes the wild card atoms like [*:1] 
are either immediately before or after the attachment point. Otherwise it will 
give you (using the R-groups you actually posted):

>>> s2 = 
>>> smiles_syntax.convert_labeled_wildcards_to_closures("[H]C([H])([H])[*:1].[H][*:2]")
Traceback (most recent call last):
  File "", line 1, in 
  File "/Users/dalke/cvses/mmpdb/mmpdblib/smiles_syntax.py", line 130, in 
convert_labeled_wildcards_to_closures
return convert_wildcards_to_closures(new_smiles, offsets)
  File "/Users/dalke/cvses/mmpdb/mmpdblib/smiles_syntax.py", line 98, in 
convert_wildcards_to_closures
new_smiles, new_smiles[wildcard_start-1:])
NotImplementedError: ('intermediate groups not supported', 
'[H]C([H])([H])[*].[H][*]', ')[*].[H][*]')

All this means is I didn't write the code to count the number of intermediate 
branches/matched parentheses between the attachment point a and the wildcard 
atom. ("Count" because I would need to invert any chirality on the base atom if 
there were an odd number of intermediate groups.) Such code wouldn't be hard to 
add.

It's not there because my experience is that RDKit only placed the "*" atoms in 
one of those two locations. However, as I just learned, if you leave the 
hydrogens in then the [H] atoms have priority:

>>> Chem.SanitizeMol(mol,Chem.SANITIZE_ALL^Chem.SANITIZE_CLEANUP^Chem.SANITIZE_PROPERTIES)
rdkit.Chem.rdmolops.SanitizeFlags.SANITIZE_NONE
>>> Chem.MolToSmiles(mol)
'[H]C([H])([H])[*:1]'

Then again, explicit [H] atoms aren't important for your end goal, so you could 
just recanonicalize all of your R-groups first, to ensure they are in the RDKit 
form, then use the SMILES rewriter.

For what it's worth, I coined the term "welding" to describe this technique of 
converting the labeled R-groups into ring-closures, then "." (dis)connected 
them to parse them as a single odd-looking SMILES.

Andrew
da...@dalkescientific.com




> On Apr 15, 2018, at 21:16, Patrick Walters  wrote:
> 
> Hi All,
> 
> I was about to write a function to reassemble a molecule from a core + 
> R-groups, but I thought I'd check and see if such a function already exists.  
> This is work with the output of rdRGroupDecomposition
> 
> Gvien a core:
> CN(C)CC(Br)c1cc([*:2])c([*:1])cn1
> 
> Plus a set of R-groups
> [H]C([H])([H])[*:1]
> [H][*:2]
> 
> Reconnect the pieces to generate a molecule
> CN(C)CC(Br)c1ccc(C)cn1
> 
> Thanks,
> 
> Pat


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] [Rdkit-announce] [Announcement] 7th RDKit UGM in Cambridge UK

2018-04-11 Thread Andrew Dalke
On Apr 7, 2018, at 07:13, Greg Landrum <greg.land...@gmail.com> wrote:
> Andrew Dalke (Dalke Scientific) will offer a course on Python and the RDKit

I need to finalize what I'm going to cover. I've been going between two 
approaches.

  1) Python programming for cheminformatics

This is meant for someone who has some programming experience but has not
worked with Python before. It will cover the basics of the language (variables,
for loops, defining functions), with examples based on the RDKit. The students
will work in an IPython notebook, and the exercises will also touch on pandas
and a few other important tools.

  2) Introduction to the RDKit and its ecosystem

This is meant for people who have some Python programming experience and
want to learn the basics of RDKit and how it works with other tools like
IPython, pandas, and Postgres.


I don't have a good sense of which approach is more helpful to the sorts of
new(ish) RDKit users who would go to an RDKit UGM.

I lean towards the first because my experience (when I taught similar courses
5-10 years ago) was that many students had no experience with Python, and
certainly it wasn't something they learned in any formal sense.

But times may have changed.

If you have feedback or commentary, please let me know by private email.

Thanks!


Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] smiview 1.2

2018-04-03 Thread Andrew Dalke
About 10 days ago I posted a prototype program called 'smiview', which displays 
information about the structure of a SMILES string.

Thanks to feedback from a couple of users, and a deep urge to explore the idea, 
I've just released smiview 1.2, available from 
https://bitbucket.org/dalke/smiview/downloads/smiview-1.2.tar.gz .

For details about what it can do, see the README at 
https://bitbucket.org/dalke/smiview .

Some of the changes are:
 - lots of bug fixes
 - the SMILES tokenizer will now try to parse the contents of an atom in []s
 - the atom indicators now point to the first character of the element symbol(s)
 rather than the first character of the atom token (e.g., the "C" in 
"[35Cl]"
 and not the "[")
 - the 'closures' track now highlights the atoms involved in a minimal cycle
 for a closure, rather than the SMILES string between the two closure points
 - more control over some of the styles
 - there is now code to generate the molecular graph, which means smiview can 
also
report errors like C11 (closure to itself) and C1C1 (two bonds between 
atoms)
 - new tracks, like "hcounts" to show the number of implicit hydrogens on each 
atom,
and "symclasses" to show each atom's symmetry class
 - support for both RDKit and OEChem, or no toolkit, albeit with reduced 
functionality
 - options to modify the input SMILES so all atoms have explicit hydrogen 
counts,
 and to set the isotope and atom class fields base on the atom index, 
symmetry
 class, or element number.
 - cleaned up and re-organized the internals. It now uses an experimental 
property
calculation dependency system, and has a "track manager" to organize the 
tracks.

Here's what it looks like with most of the tracks enabled (which is rather 
overwhelming):

% smiview 'Cn1c(=O)c2c(ncn2C)n(C)c1=O' --fancy
┌   1 1 1  1
   atoms│ 01 2  3 4 5 678 9 0 1 2  3
└ || |  | | | ||| | | | |  |
byte offsets┌   1122
└ 050505
 token types[ AA%A(BA)A%A(AAA%A)A(A)A%BA
  SMILES[ Cn1c(=O)c2c(ncn2C)n(C)c1=O
 hcounts[ 30 0  0 0 0 010 3 0 3 0  0
branches┌*(..)  *(.)
└   *(.)
closures┌  *1** *   *   *1
└ *2*.***2 .
   fragments[ 00
  symclasses┌ 01 7  3 9 1 651 1 1 2 8  4
└  10   2   3 

I'll focus on just the closures, and give more emphasis to the element symbols 
which make up either end of the closure (marked with a "*") while the other 
atoms in the closure ring are marked with an "x":

% smiview 'Cn1c(=O)c2c(ncn2C)n(C)c1=O' -b closures --closure-atom-style 
end-elements
┌   1 1 1  1
   atoms│ 01 2  3 4 5 678 9 0 1 2  3
└ || |  | | | ||| | | | |  |
  SMILES[ Cn1c(=O)c2c(ncn2C)n(C)c1=O
closures┌  *1xx x   x   *1
└ *2x.xx*2 .

With a bit of counting of *'s and x's you can see there's a ring of size 6 and 
another of size 5.

Here's an example of the input syntax processing; I'll convert all of the atoms 
to use the bracket form, by adding the correct hydrogen count to each 
non-bracket atom:

% smiview 'Cn1c(=O)c2c(ncn2C)n(C)c1=O' --use-brackets -a input-smiles -b none 
--width 80
input smiles[  Cn 1 c (= O ) c 2 c ( n  c   n 2 C   ) n ( C   ) c 1= O
  SMILES[ [CH3][n]1[c](=[O])[c]2[c]([n][cH][n]2[CH3])[n]([CH3])[c]1=[O]

If you want the modified SMILES string from another program, or to copy 
it, then turn off the legend and use a large enough width. Here I'll also set 
the isotope to the atom index+1, which might be used as a way to tag the atoms:

% smiview 'Cn1c(=O)c2c(ncn2C)n(C)c1=O' --set-isotope index+1 -a none -b none 
--width 10 --legend off
[1CH3][2n]1[3c](=[4O])[5c]2[6c]([7n][8cH][9n]2[10CH3])[11n]([12CH3])[13c]1=[14O]

Let me know what you think.


Andrew
da...@dalkescientific.com


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] smiview 1.1 - a console tool to view SMILES strings

2018-03-24 Thread Andrew Dalke
Over the last few days I've developed a command-line tool that I call "smiview".

It's a SMILES viewer. It isn't a depiction tool where the input is in SMILES 
but rather a tool to highlight different aspects of the SMILES string.

I'll put some examples at the end. If you want to try it out you can download 
it from

  https://bitbucket.org/dalke/smiview/downloads/smiview-1.1.tar.gz

or see the README from the project page at

  https://bitbucket.org/dalke/smiview

As the README says, this was mostly built for fun. I would like to know if you 
use it for serious work.

I have no long terms plan for this project.

I think it would be cool (perhaps for pedagogical reasons) to have a GUI 
version in an IPython notebook, tied to a graphical depiction so you can see 
the connection between the SMILES terms and the depiction. I don't have those 
skills, so if it's something you want to do, you might look to this code as a 
starting point.

It was developed under Python 3 and it seems to work under Python 2.7.

Enjoy!


Andrew
da...@dalkescientific.com


The default view highlights the atom locations (and numbers them), shows the 
branches and the atom that the branches are attached to, shows the closures 
(highlighting the closure atom and the closure indicator for both side of the 
closure), and shows you which part of the SMILES string correspond to which 
fragments.

% smiview 
'C#CCC[N+](C)(C)[N+](C)(C)CCC#C.Cc1ccc(S(=O)(=O)[O-])cc1.Cc1ccc(S(=O)(=O)[O-])cc1'
 ┌   112  2 222 2 22 223 3  3   3 3
atoms│ 0 12345  6 78901234567890  1 234 5 67 890 1  2   3 4
 └ | |  | ||  | ||| | || ||| |  |   | |
   SMILES[ C#CCC[N+](C)(C)[N+](C)(C)CCC#C.Cc1ccc(S(=O)(=O)[O
 branches┌  *---(.)(.)*---(.)(.)   *(...
 └   *(..)(..)
 closures┌1*↑ 1 
 └
 ┌ 00
fragments│11
 └

 ┌33  33 344 4  4   4 444
atoms│56  78 901 2  3   4 567
 └||  || ||| |  |   | |||
   SMILES[ -])cc1.Cc1ccc(S(=O)(=O)[O-])cc1
 branches┌ ..) *(.)
 └   *(..)(..)
 closures┌ *↑1
 └1*↑ 1 *↑1
 ┌
fragments│ 11
 └


(Thanks to Greg for feedback on the 'branches' visualization, and for 
suggesting the 'fragments' track.)


If you give it a SMARTS pattern it will show where the corresponding atoms are:

% smiview 'CC1CC2C3CCC4=CC(=O)C=CC4(C)C3(F)C(O)CC2(C)C1(O)C(=O)CO' --smarts 
'*=O'
   ┌  1 1 11  1 1  1 1 1 12  2 2  2 2  2 22
  atoms│ 01 23 4 567  89  0 1 23  4 5  6 7 8 90  1 2  3 4  5 67
   └ || || | |||  ||  | | ||  | |  | | | ||  | |  | |  | ||
 SMILES[ CC1CC2C3CCC4=CC(=O)C=CC4(C)C3(F)C(O)CC2(C)C1(O)C(=O)CO
match 1[   *  *
match 2[*  *


If you give it an atom index, it shows the neighbors around that index, along 
with a summary of how the atom is attached to those neighbors:


% smiview 'NC(=O)c1nonc1CNC(c1c[nH]cn1)C1CCNCC1' --atom-index 10
 ┌1 1 11   11  1 11122
atoms│ 01  2 3 4567 890 1 23   45  6 78901
 └ ||  | |  ||| | ||   ||  | |
   SMILES[ NC(=O)c1nonc1CNC(c1c[nH]cn1)C1CCNCC1
neighbors┌   ^X ^  ^
 └C(-N9)(-c11)(-C16)


You can modify how things look. In the following I'll have it show the offset 
to each character in the SMILES string on the top, and the atom indices on the 
bottom.


% smiview 'NC(=O)c1nonc1CNC(c1c[nH]cn1)C1CCNCC1' -a offsets -b atoms -b closures
byte offsets┌   112233
└ 05050505
  SMILES[ NC(=O)c1nonc1CNC(c1c[nH]cn1)C1CCNCC1
┌ ||  | |  ||| | ||   ||  | |
   atoms│ 01  2 3 4567 891 1 11   11  1 11122
└0 1 23   45  6 78901


Here's the output for buckminsterfullerene:

% smiview 
'c12c3c4c5c1c6c7c8c2c9c1c3c2c3c4c4c%10c5c5c6c6c7c7c%11c8c9c8c9c1c2c1c2c3c4c3c4c%10c5c5c6c6c7c7c%11c8c8c9c1c1c2c3c2c4c5c6c3c7c8c1c23'
 --legend once
 ┌   1 1 1 1 1 1 1   1 1 1 2 2 2 2   2 2 2 2
atoms│  0  1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6   7 8 9 0 1 2 3   4 5 6 7
 └  |  | | | | | | | | | | | | | | | |   | | | | | | |   | | | |
   SMILES[  c12c3c4c5c1c6c7c8c2c9c1c3c2c3c4c4c%10c5c5c6c6c7c7c%11c8c9c8c
 branches[
 ┌ 1*↑---*↑1  8*↑-- 8 --- 8 --- 8 ---*↑8  9*
 │ 2*-↑- 2 --*↑22*↑-- 2 --- 2 --- 2 
 │3*↑--- 3 *↑34*↑--- 4 --- 4 ---
 │  

  1   2   3   >