Re: [Rdkit-discuss] MFP question about similar substructures and feature reduction

Rafael L via Rdkit-discuss Wed, 29 Sep 2021 12:34:23 -0700

Hello, your question prompted me to write a small notebook, which I hope
you may find useful:
https://github.com/rflameiro/projects/blob/main/comparing_fingerprint_bits.ipynb


In summary, bits that are active in both fingerprints usually correspond to
the same substructure, unless bit collision happens. You can verify that by
drawing the substructure that activates a certain bit using the
function Draw.DrawMorganBit().

-- What happens if the 2048 bits or substructures predesignated in rdkit do
not contain a new substructure in a molecule we are evaluating?
If I understand correctly, you want to know what will a fingerprint look
like for a molecule that doesn't have new substructures compared to a
previously calculated fingerprint. In this case, the new fingerprint will
be the same (although this is more common when working with MACCS
fingerprints, which work with a predetermined set of substructures), or the
new molecule will have less substructures than the previous one, and less
bits will be active.

-- Any advice on how to reduce features and then use that reduced feature
list for new molecules after training a model would also be appreciated.
How would the model only extract the reduced bits for a new ligand if I
remove low variance bits from the training set for example?
To build models on fingerprints, you can start using the complete set of
2048 bits, and compare the performance with fingerprints containing less
bits (1024, 512...). A good starting point is:
https://www.moreisdifferent.com/2017/9/21/DIY-Drug-Discovery-using-molecular-fingerprints-and-machine-learning-for-solubility-prediction/
You should see a drop in performance as the bit size decreases, as bit
collisions are more likely.
Alternatively, you could try reducing the dimensionality by using a
technique such as PCA, but use enough PCs to get a reasonable explained
variance percentage. It is easy to calculate PCs with scikit-learn. Then,
to apply it in new fingerprints, you will only have to call .transform().
See:
https://stackabuse.com/implementing-pca-in-python-with-scikit-learn/

Em seg., 27 de set. de 2021 às 20:35, Natasha Gupta <ngupt...@gmail.com>
escreveu:

> Hello,
>
> Apologies. this is a very basic question:
> If I am converting many ligands into morgan fingerprints, could I
> theoretically stack the bit representations on top of each other to get the
> same features represented across ligands? For example is the below
> representation correct?
>
> | sample | feature1 | feature2 | feature3 |
> |:----   |:--------:|:--------:|---------:|
> | 1      | bit 1    | bit 2    | bit 3    |
> | 2      | bit 1    | bit 2    | bit 3    |
> | 3      | bit 1    | bit 2    | bit 3    |
>
> So basically is feature 1, 2, 3 etc always one type of substructure no
> matter what the input molecule is? What happens if the 2048 bits or
> substructures predesignated in rdkit do not contain a new substructure in a
> molecule we are evaluating?
>
> Any advice on how to reduce features and then use that reduced feature
> list for new molecules after training a model would also be appreciated.
> How would the model only extract the reduced bits for a new ligand if I
> remove low variance bits from the training set for example?
>
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>


-- 
Rafael da Fonseca Lameiro
[image: orcid logo 16px] https://orcid.org/0000-0003-4466-2682
Aluno de Doutorado - Grupo de Química Medicinal e Biológica (NEQUIMED)
Instituto de Química de São Carlos - Universidade de São Paulo - Brasil
Av. Trabalhador Sancarlense, 400 - CEP: 13566-590 - São Carlos/SP

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] MFP question about similar substructures and feature reduction

Reply via email to