Re: [Rdkit-discuss] Request for Assistance with MACCS 166 Fingerprint Calculation for 3D QSAR Study

Andrew Dalke Tue, 30 Apr 2024 14:12:34 -0700

Hi Ariadna,

  In general the MACCS keys are not that good for comparing similarity. They 
exist still for historical reasons. Back in the 1970s the company Molecular 
Design Limited developed a program called "Molecular Access System" (MACCS) for 
structure registration, substructure search, and the like.


Substructure search is slow, so MACCS includes a set of keys which would act as 
fast filters - if the query contained a key but the database entry did not, 
then the query could not match that entry.

In the 1980s when fingerprint similarity search first became popular - this is 
before the term "fingerprint" was even coined - people used the MACCS keys 
because they were already computed and sitting there, on the computer system 
they were already using.

Over time people developed other types of fingerprints, and different ways to 
compare them, and a more complete understanding of how they are coupled to the 
types of system being studied.

For example, in "Comparing structural fingerprints using a literature-based 
similarity benchmark" by Sayle and O'Boyle, "Extended-connectivity fingerprints 
of diameter 4 and 6 are among the best performing fingerprints when ranking 
diverse structures by similarity, as is the topological torsion fingerprint. 
However, when ranking very close analogues, the atom pair fingerprint 
outperforms the others tested."

They found the MACCS fingerprints to be one of the worst performers, which you 
might expect now that you know the happenstance which made them popular.

Since you are doing 3D QSAR, you should familiarize yourself with the 
fingerprints used in that area. I have no experience with 3D QSAR and cannot 
provide advice on what is appropriate. 

The first paper I found using Google Scholar to search for "3d qsar 
fingerprints" is "Docking, Interaction Fingerprint, and Three-Dimensional 
Quantitative Structure–Activity Relationship (3D-QSAR) of Sigma1 Receptor 
Ligands, Analogs of the Neuroprotective Agent RC-33" at 
https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6637851/ which uses Interaction 
fingerprints.

The second is "Novel TOPP descriptors in 3D-QSAR analysis of apoptosis inducing 
4-aryl-4H-chromenes: Comparison versus other 2D- and 3D-descriptors" at 
https://www.sciencedirect.com/science/article/pii/S0968089607005834 which I 
mention to because it summarizes 7 different descriptor-based approaches, and 
places the MACCS keys in last place, far below the second worst ("TOPP > GRIND 
> BCI 4096 = ECFP > FCFP > GRID-GOLPE ≫ DRAGON ⋙ MDL 166").

No doubt there are many others for you to read through and try out.


> # Generate fingerprint descriptor database
> fps = [AllChem.GetMorganFingerprintAsBitVect(m, 2) for m in mols]

What I can suggest is you try my chemfp package, specifically the 3.2b1 I just 
released (bear in mind that it is beta!)

You can install it with:

   python -m pip install chemfp==4.2b1 -i https://chemfp.com/packages/

To generate Morgan fingerprints of radius 2, I suggest you compute them once 
and store them in a file, like this command-line example:

  rdkit2fps --morgan2 dataset.smi -o dataset.fps

(use "--maccs" to generate MACCS keys, "--pair" for atom pairs; and use 
"--help" to see what other options are available.)

To "Calculate pairwise Tanimoto similarity between fingerprints" as a distance, 
you can use another command-line tool to generate the matrix as a NumPy "npy" 
file, like this:

  chemfp simarray dataset.fps --as-distance -o dataset.npy

To load this in Python:

  import numpy as np
  dists = np.load("dataset.npy")

If you also need the identifiers:

  with open("dataset.npy", "rb") as f:
    dists = np.load(f)
    metadata = np.load(f)
    ids = np.load(f)

This should make it easier to iterate over the different clustering methods 
available, since you only generate the fingerprints and distance matrix once.

If you decide to use interaction fingerprints, or some other fingerprint type 
that is not in the RDKit, you can still generate the fingerprints in FPS format 
(a simple text format) and use chemfp to generate your matrix for you, either 
on the command-line or through its Python API.

> However, I'm not satisfied with the results and would like to experiment with 
> MACCS Keys to see if they yield better clustering outcomes. Does anyone know 
> how to cluster compounds using MACCS fingerprints? Any insights on the best 
> approach to calculate similarities and cluster using these fingerprints would 
> be highly appreciated.

In case I was not clear enough before, MACCS keys make poor fingerprints. There 
is no reason to expect they will yield better clustering outcomes, and multiple 
papers which suggest they will make worse outcomes.

Best regards,

                                Andrew
                                da...@dalkescientific.com




_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Request for Assistance with MACCS 166 Fingerprint Calculation for 3D QSAR Study

Reply via email to