I'd be wary of using PCA on binary fingerprints based on Martin and Cao
(2015 <https://dx.doi.org/10.1007/s10822-014-9819-y>)

On Wed, Sep 29, 2021 at 3:34 PM Rafael L via Rdkit-discuss <
rdkit-discuss@lists.sourceforge.net> wrote:

> Hello, your question prompted me to write a small notebook, which I hope
> you may find useful:
>
> https://github.com/rflameiro/projects/blob/main/comparing_fingerprint_bits.ipynb
>
> In summary, bits that are active in both fingerprints usually correspond
> to the same substructure, unless bit collision happens. You can verify that
> by drawing the substructure that activates a certain bit using the
> function Draw.DrawMorganBit().
>
> -- What happens if the 2048 bits or substructures predesignated in rdkit
> do not contain a new substructure in a molecule we are evaluating?
> If I understand correctly, you want to know what will a fingerprint look
> like for a molecule that doesn't have new substructures compared to a
> previously calculated fingerprint. In this case, the new fingerprint will
> be the same (although this is more common when working with MACCS
> fingerprints, which work with a predetermined set of substructures), or the
> new molecule will have less substructures than the previous one, and less
> bits will be active.
>
> -- Any advice on how to reduce features and then use that reduced feature
> list for new molecules after training a model would also be appreciated.
> How would the model only extract the reduced bits for a new ligand if I
> remove low variance bits from the training set for example?
> To build models on fingerprints, you can start using the complete set of
> 2048 bits, and compare the performance with fingerprints containing less
> bits (1024, 512...). A good starting point is:
>
> https://www.moreisdifferent.com/2017/9/21/DIY-Drug-Discovery-using-molecular-fingerprints-and-machine-learning-for-solubility-prediction/
> You should see a drop in performance as the bit size decreases, as bit
> collisions are more likely.
> Alternatively, you could try reducing the dimensionality by using a
> technique such as PCA, but use enough PCs to get a reasonable explained
> variance percentage. It is easy to calculate PCs with scikit-learn. Then,
> to apply it in new fingerprints, you will only have to call .transform().
> See:
> https://stackabuse.com/implementing-pca-in-python-with-scikit-learn/
>
> Em seg., 27 de set. de 2021 às 20:35, Natasha Gupta <ngupt...@gmail.com>
> escreveu:
>
>> Hello,
>>
>> Apologies. this is a very basic question:
>> If I am converting many ligands into morgan fingerprints, could I
>> theoretically stack the bit representations on top of each other to get the
>> same features represented across ligands? For example is the below
>> representation correct?
>>
>> | sample | feature1 | feature2 | feature3 |
>> |:----   |:--------:|:--------:|---------:|
>> | 1      | bit 1    | bit 2    | bit 3    |
>> | 2      | bit 1    | bit 2    | bit 3    |
>> | 3      | bit 1    | bit 2    | bit 3    |
>>
>> So basically is feature 1, 2, 3 etc always one type of substructure no
>> matter what the input molecule is? What happens if the 2048 bits or
>> substructures predesignated in rdkit do not contain a new substructure in a
>> molecule we are evaluating?
>>
>> Any advice on how to reduce features and then use that reduced feature
>> list for new molecules after training a model would also be appreciated.
>> How would the model only extract the reduced bits for a new ligand if I
>> remove low variance bits from the training set for example?
>>
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
> --
> Rafael da Fonseca Lameiro
> [image: orcid logo 16px] https://orcid.org/0000-0003-4466-2682
> Aluno de Doutorado - Grupo de Química Medicinal e Biológica (NEQUIMED)
> Instituto de Química de São Carlos - Universidade de São Paulo - Brasil
> Av. Trabalhador Sancarlense, 400 - CEP: 13566-590 - São Carlos/SP
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>


-- 
Rajarshi Guha | http://blog.rguha.net | @rguha <https://twitter.com/rguha>
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to