Hi all,

  I have added experimental support for CDK in chemfp. I'm looking for friendly 
users to try it out and provide feedback, using a pre-compiled package for 
Python 3.7 or later, on Linux-based OSes.

Chemfp is a Python package for fingerprint generation and high-performance 
similarity search. It depends on a third-party package to handle chemistry, 
which traditionally were Open Babel, RDKit, and Open Eye's OEChem/OEGraphSim.

The new CDK support in principle includes:

 fingerprint types: CDK-AtomPairs2D, CDK-Daylight, CDK-ECFP0, CDK-ECFP2, 
CDK-ECFP4, CDK-ECFP6, CDK-EState, CDK-Extended, CDK-FCFP0, CDK-FCFP2, 
CDK-FCFP4, CDK-FCFP6, CDK-GraphOnly, CDK-Hybridization, CDK-MACCS, CDK-Pubchem, 
CDK-ShortestPath, CDK-Substructure

 input formats: smi, can, usm, sdf, smistring, canstring, usmstring, molfile, 
inchi, inchistring, pdb, cif, cml, ctx, crystclust, gamess, xyz

 output formats: smi, can, usm, sdf, smistring, canstring, usmstring, molfile, 
inchi, inchikey, inchistring, inchikeystring, pdb, cdkjava, cml, crystclust, xyz

(In chemfp's format nomenclature, "smi", "can", and "usm" read a SMILES with 
optional id as input, but "smi" generates a canonical isomeric SMILES as 
output, "can" generates a canonical non-isomeric SMILES, and "usm" generate a 
non-canonical, isomeric SMILES. The "*string" variants ignore any possible id.)

NOTE: The CDK support is only lightly-tested. You'll need to double-check 
everything before you can trust the results, and there are likely many ways to 
make it crash. My goals with this experiment are to see:

   1) if there's further interest in it (especially from those willing to pay 
for a license);

   2) did I get the chemistry handling correct (ShortestPath fingerprint with 
input from the SMILES "c1ccccc1O" results in "java.lang.NullPointerException: 
atom has unset atom type");

   3) which structure formats should I support / should I ignore? ;

   4) how to handle Java packages from Python.

   5) how CDK works, and more importantly, how it fails.
 

Chemfp is NOT a no-cost/open-source package. The base license agreement is 
https://chemfp.com/BaseLicense.txt . There is also a no-cost academic license, 
the option to get the source code with a proprietary license, or a more 
expensive option to get the source code under the MIT license. If that's okay 
with you then you can install it with:

  python -m pip install JPype1
  python -m pip install --upgrade chemfp -i https://chemfp.com/experimental_cdk

The JPype1 package installs the adapter that chemfp uses to work with the CDK 
jar file, which must be on your CLASSPATH.

The https://chemfp.com/experimental_cdk/ site has pre-built packages for Python 
3.7, 3.8, and 3.9 on Linux-based OSes.

In case the license is an issue, or you want to jump right to the relevant 
source code, I placed a copy of those CDK-specific files at

   https://chemfp.com/experimental_cdk/chemfp/chemfp-cdk-toolkit.tar.gz

These are distributed under the MIT license.

Acknowledgements
=================

Many thanks to Noel O'Boyle for cinfony, which greatly helped me understand how 
to work with CDK from Python via Jpype, and thanks to John Mayfield for his 
pointers and feedback.


Examples
========

I figure people here might not have used chemfp before, so I'll give some 
examples of using it from the command-line and from Python.

The default reads a SMILES file from stdin and generates "CDK-Daylight" 
fingerprints in FPS format to stdout. The "Daylight" fingerprint is a a 
cdk.fingerprint.Fingerprinter() instance. I followed Noel's/cinfony's lead in 
calling it "Daylight" instead of "Fingerprint". Let me know if that's an issue.

% echo "c1ccccc1 phenol" | cdk2fps
#FPS1
#num_bits=1024
#type=CDK-Daylight/2.0 size=1024 searchDepth=7 pathLimit=42000 hashPseudoAtoms=0
#software=CDK/2.3 chemfp/3.5a1
#date=2020-11-20T10:50:58
0000000000000000000000800000000000000000020000000000200000000020000000000000000000000000000000000000000000000000000000000000000000000000000000400000000000000000000000000000000000000000400000000000000000000000000000000000000000000000000000000000000000000000
        phenol


I'll generate MACCS keys for ChEBI:


% cdk2fps --MACCS ChEBI_complete.sdf.gz
ERROR: Missing title in SD record, file 'ChEBI_complete.sdf.gz'. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_complete.sdf.gz'. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_complete.sdf.gz'. Skipping.
ERROR: Missing title in SD record, file 'ChEBI_complete.sdf.gz'. Skipping.
  ...

Oh, that's right - ChEBI doesn't use the title line of the SDF record, so I 
need to get it from the one of the tag fields:

% cdk2fps --MACCS ChEBI_complete.sdf.gz --id-tag "ChEBI ID"
#FPS1
#num_bits=166
#type=CDK-MACCS/2.0
#software=CDK/2.3 chemfp/3.5a1
#source=ChEBI_complete.sdf.gz
#date=2020-11-20T10:56:12
000000000000302180000003828101500ccda3501e      CHEBI:90
00000000000000000202008000890c400050b4821c      CHEBI:165
0000000000080000800002032012002c080e825708      CHEBI:598
00000000000000008200008480892dc00dc4a7d21e      CHEBI:776
000000000001000000004000400400002080200017      CHEBI:943
000000000000200080000002800002040c0482d608      CHEBI:1148
  ...


This fails with:

000000000008000102000481028381100081a1501e      CHEBI:26377
Traceback (most recent call last):
  ...
  File 
"/Users/dalke/venvs/py39-2020-3/lib/python3.9/site-packages/chemfp/types.py", 
line 845, in _compute_molecule_fingerprints
    yield id, fingerprinter(mol)
  File 
"/Users/dalke/venvs/py39-2020-3/lib/python3.9/site-packages/chemfp/cdk_types.py",
 line 226, in cdk_maccs_fingerprinter
    fp = fingerprinter.getBitFingerprint(mol)
java.lang.NullPointerException: java.lang.NullPointerException: Aromaticity 
model requires implicit hydrogen count is set.

because the current chemfp design expects that a generating a fingerprint would 
never fail.


The "cdk2fps" program supports all of the CDK fingerprint types which generate 
binary strings, except that JPype can't find the KlekotaRoth fingerprinter, and 
ShortestPath gives me the exception "java.lang.NullPointerException: atom has 
unset atom type" even if I read "c1ccccc1O" as SMILES input.

Once you have FPS files, you can search them with "simsearch". To learn more 
about the chemfp command-line programs, see 
https://chemfp.readthedocs.io/en/latest/using-tools.html .

Chemfp also has a toolkit I/O API, which is a wrapper layer around the 
underlying toolkit that gives a consistent way to read and write structure 
files, and string containing structures. Here's an example of converting phenol 
into "sdf" format:

>>> from chemfp import cdk_toolkit as T
>>> mol = T.parse_molecule("c1ccccc1O", "smi")
>>> print(T.create_string(mol, "sdf"))

  CDK     1120201035

  7  7  0  0  0  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  2  0  0  0  0
  2  3  1  0  0  0  0
  3  4  2  0  0  0  0
  4  5  1  0  0  0  0
  5  6  2  0  0  0  0
  1  6  1  0  0  0  0
  6  7  1  0  0  0  0
M  END
$$$$

The CDK SDF writer supports many options, which can be specified in an optional 
"writer_args" dictionary. Chemfp support introspection for the different 
formats, so here's how to get the default writer arguments:

>>> T.get_format("sdf").get_default_writer_args()
{'WriteAromaticBondTypes': False, 'WriteMajorIsotopes': True, 
'writeProperties': True, 'WriteQueryFormatValencies': False, 'ProgramName': 
'CDK', 'ForceWriteAs2DCoordinates': False, 'WriteDefaultProperties': True, 
'writeV3000': False}

The "writeV3000" looks interesting, so I'll switch it to True:

>>> print(T.create_string(mol, "sdf", writer_args={"writeV3000": True}))

  CDK     1120201042

  0  0  0     0  0            999 V3000
M  V30 BEGIN CTAB
M  V30 COUNTS 7 7 0 0 0
M  V30 BEGIN ATOM
M  V30 1 C 0 0 0 0
M  V30 2 C 0 0 0 0
M  V30 3 C 0 0 0 0
M  V30 4 C 0 0 0 0
M  V30 5 C 0 0 0 0
M  V30 6 C 0 0 0 0
M  V30 7 O 0 0 0 0
M  V30 END ATOM
M  V30 BEGIN BOND
M  V30 1 2 1 2
M  V30 2 1 2 3
M  V30 3 2 3 4
M  V30 4 1 4 5
M  V30 5 2 5 6
M  V30 6 1 1 6
M  V30 7 1 6 7
M  V30 END BOND
M  V30 END CTAB
M  END
$$$$

Yup, that's a V3000 record.

Chemfp support introspection of the different fingerprinters. In chemfp, a 
"fingerprint family" is the class of possible fingerprint types, where a 
"fingerprint type" is a specific instantiation of that family. There's only one 
MACCS fingerprint type in the MACCS fingerprint family, while the "Daylight" 
family has a few possible parameters.

>>> import chemfp
>>> chemfp.get_fingerprint_families("cdk")
[FingerprintFamily(<CDK-AtomPairs2D/2.0>), 
FingerprintFamily(<CDK-Daylight/2.0>),
FingerprintFamily(<CDK-ECFP0/2.0>), FingerprintFamily(<CDK-ECFP2/2.0>),
FingerprintFamily(<CDK-ECFP4/2.0>), FingerprintFamily(<CDK-ECFP6/2.0>),
FingerprintFamily(<CDK-EState/2.0>), FingerprintFamily(<CDK-Extended/2.0>),
FingerprintFamily(<CDK-FCFP0/2.0>), FingerprintFamily(<CDK-FCFP2/2.0>),
FingerprintFamily(<CDK-FCFP4/2.0>), FingerprintFamily(<CDK-FCFP6/2.0>),
FingerprintFamily(<CDK-GraphOnly/2.0>), 
FingerprintFamily(<CDK-Hybridization/2.0>),
FingerprintFamily(<CDK-MACCS/2.0>), FingerprintFamily(<CDK-Pubchem/2.0>),
FingerprintFamily(<CDK-ShortestPath/2.0>), 
FingerprintFamily(<CDK-Substructure/2.0>)]

>>> family = chemfp.get_fingerprint_family("CDK-Daylight")
>>> family.get_defaults()
{'size': 1024, 'searchDepth': 7, 'pathLimit': 42000, 'hashPseudoAtoms': 0}

Fingerprint types can be created in one of several ways. Here's the most common:

>>> import chemfp
>>> fptype = chemfp.get_fingerprint_type("CDK-MACCS")

This has a number of methods, like:

>>> fptype.parse_id_and_molecule_fingerprint("c1ccccc1O phenol", "smi")
('phenol', 
b'\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x01@\x00D\x80\x10\x1e')

Or, if you have a molecule already, you can compute its fingerprint:

>>> from chemfp import cdk_toolkit as T
>>> mol = T.parse_molecule("c1ccccc1P", "smistring")
>>> fptype.compute_fingerprint(mol)
b'\x00\x00\x00\x10\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x00\x16'

The fingerprint type has a reference to relevant: 

>>> fptype.toolkit
<module 'chemfp.cdk_toolkit' from 
'/Users/dalke/venvs/py39-2020-3/lib/python3.9/site-packages/chemfp/cdk_toolkit.py'>


Which means if you want to convert an input SMILES file into an SD file, and 
include a tag based on the toolkit's MACCS generation code, then you can do:

import chemfp
# fptype = chemfp.get_fingerprint_type("RDKit-MACCS166") # for RDKit
# fptype = chemfp.get_fingerprint_type("OpenBabel-MACCS") # for Open Babel
# fptype = chemfp.get_fingerprint_type("CDK-MACCS") # for OEChem/OEGraphSim
fptype = chemfp.get_fingerprint_type("CDK-MACCS")
T = fptype.toolkit
with T.read_molecules("ZINC_20_structures.smi") as reader:
  with T.open_molecule_writer("ZINC_20_structures.sdf") as writer:
    for mol in reader:
      fp = fptype.compute_fingerprint(mol)
      T.add_tag(mol, T.name + "_MACCS", fp.hex())
      writer.write_molecule(mol)

If for some reason you wanted to use Open Babel's MACCS code then comment out 
the "CDK-MACCS" line and un-comment the "OpenBabel-MACCS" line, and it will 
work.

Here's the first output record:

% cat ZINC_20_structures.sdf
ZINC00000036
  CDK     1120201039

 11 11  0  0  1  0  0  0  0  0999 V2000
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  1  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 C   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 O   0  5  0  0  0  0  0  0  0  0  0  0
    0.0000    0.0000    0.0000 O   0  0  0  0  0  0  0  0  0  0  0  0
  1  2  2  0  0  0  0
  2  3  1  0  0  0  0
  3  4  2  0  0  0  0
  4  5  1  0  0  0  0
  5  6  2  0  0  0  0
  1  6  1  0  0  0  0
  4  7  1  0  0  0  0
  7  8  1  0  0  0  0
  8  9  2  0  0  0  0
  8 10  1  0  0  0  0
  7 11  1  0  0  0  0
M  CHG  1  10  -1
M  END
> <cdk_MACCS>
00000000000001008000000000000004000482521e

$$$$


Do try it out and give me friendly feedback!

Best regards,

                                Andrew
                                da...@dalkescientific.com




_______________________________________________
Cdk-user mailing list
Cdk-user@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/cdk-user

Reply via email to