Re: [Rdkit-discuss] MFP question about similar substructures and feature reduction

2021-09-29 Thread Natasha Gupta
Thank you so much! This is so clear and very helpful.

On Wednesday, September 29, 2021, Rafael L 
wrote:

> Hello, your question prompted me to write a small notebook, which I hope
> you may find useful:
> https://github.com/rflameiro/projects/blob/main/comparing_
> fingerprint_bits.ipynb
>
> In summary, bits that are active in both fingerprints usually correspond
> to the same substructure, unless bit collision happens. You can verify that
> by drawing the substructure that activates a certain bit using the
> function Draw.DrawMorganBit().
>
> -- What happens if the 2048 bits or substructures predesignated in rdkit
> do not contain a new substructure in a molecule we are evaluating?
> If I understand correctly, you want to know what will a fingerprint look
> like for a molecule that doesn't have new substructures compared to a
> previously calculated fingerprint. In this case, the new fingerprint will
> be the same (although this is more common when working with MACCS
> fingerprints, which work with a predetermined set of substructures), or the
> new molecule will have less substructures than the previous one, and less
> bits will be active.
>
> -- Any advice on how to reduce features and then use that reduced feature
> list for new molecules after training a model would also be appreciated.
> How would the model only extract the reduced bits for a new ligand if I
> remove low variance bits from the training set for example?
> To build models on fingerprints, you can start using the complete set of
> 2048 bits, and compare the performance with fingerprints containing less
> bits (1024, 512...). A good starting point is:
> https://www.moreisdifferent.com/2017/9/21/DIY-Drug-
> Discovery-using-molecular-fingerprints-and-machine-
> learning-for-solubility-prediction/
> You should see a drop in performance as the bit size decreases, as bit
> collisions are more likely.
> Alternatively, you could try reducing the dimensionality by using a
> technique such as PCA, but use enough PCs to get a reasonable explained
> variance percentage. It is easy to calculate PCs with scikit-learn. Then,
> to apply it in new fingerprints, you will only have to call .transform().
> See:
> https://stackabuse.com/implementing-pca-in-python-with-scikit-learn/
>
> Em seg., 27 de set. de 2021 às 20:35, Natasha Gupta 
> escreveu:
>
>> Hello,
>>
>> Apologies. this is a very basic question:
>> If I am converting many ligands into morgan fingerprints, could I
>> theoretically stack the bit representations on top of each other to get the
>> same features represented across ligands? For example is the below
>> representation correct?
>>
>> | sample | feature1 | feature2 | feature3 |
>> |:   |::|::|-:|
>> | 1  | bit 1| bit 2| bit 3|
>> | 2  | bit 1| bit 2| bit 3|
>> | 3  | bit 1| bit 2| bit 3|
>>
>> So basically is feature 1, 2, 3 etc always one type of substructure no
>> matter what the input molecule is? What happens if the 2048 bits or
>> substructures predesignated in rdkit do not contain a new substructure in a
>> molecule we are evaluating?
>>
>> Any advice on how to reduce features and then use that reduced feature
>> list for new molecules after training a model would also be appreciated.
>> How would the model only extract the reduced bits for a new ligand if I
>> remove low variance bits from the training set for example?
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
> --
> Rafael da Fonseca Lameiro
> [image: orcid logo 16px] https://orcid.org/-0003-4466-2682
> Aluno de Doutorado - Grupo de Química Medicinal e Biológica (NEQUIMED)
> Instituto de Química de São Carlos - Universidade de São Paulo - Brasil
> Av. Trabalhador Sancarlense, 400 - CEP: 13566-590 - São Carlos/SP
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] MFP question about similar substructures and feature reduction

2021-09-29 Thread Rajarshi Guha
I'd be wary of using PCA on binary fingerprints based on Martin and Cao
(2015 )

On Wed, Sep 29, 2021 at 3:34 PM Rafael L via Rdkit-discuss <
rdkit-discuss@lists.sourceforge.net> wrote:

> Hello, your question prompted me to write a small notebook, which I hope
> you may find useful:
>
> https://github.com/rflameiro/projects/blob/main/comparing_fingerprint_bits.ipynb
>
> In summary, bits that are active in both fingerprints usually correspond
> to the same substructure, unless bit collision happens. You can verify that
> by drawing the substructure that activates a certain bit using the
> function Draw.DrawMorganBit().
>
> -- What happens if the 2048 bits or substructures predesignated in rdkit
> do not contain a new substructure in a molecule we are evaluating?
> If I understand correctly, you want to know what will a fingerprint look
> like for a molecule that doesn't have new substructures compared to a
> previously calculated fingerprint. In this case, the new fingerprint will
> be the same (although this is more common when working with MACCS
> fingerprints, which work with a predetermined set of substructures), or the
> new molecule will have less substructures than the previous one, and less
> bits will be active.
>
> -- Any advice on how to reduce features and then use that reduced feature
> list for new molecules after training a model would also be appreciated.
> How would the model only extract the reduced bits for a new ligand if I
> remove low variance bits from the training set for example?
> To build models on fingerprints, you can start using the complete set of
> 2048 bits, and compare the performance with fingerprints containing less
> bits (1024, 512...). A good starting point is:
>
> https://www.moreisdifferent.com/2017/9/21/DIY-Drug-Discovery-using-molecular-fingerprints-and-machine-learning-for-solubility-prediction/
> You should see a drop in performance as the bit size decreases, as bit
> collisions are more likely.
> Alternatively, you could try reducing the dimensionality by using a
> technique such as PCA, but use enough PCs to get a reasonable explained
> variance percentage. It is easy to calculate PCs with scikit-learn. Then,
> to apply it in new fingerprints, you will only have to call .transform().
> See:
> https://stackabuse.com/implementing-pca-in-python-with-scikit-learn/
>
> Em seg., 27 de set. de 2021 às 20:35, Natasha Gupta 
> escreveu:
>
>> Hello,
>>
>> Apologies. this is a very basic question:
>> If I am converting many ligands into morgan fingerprints, could I
>> theoretically stack the bit representations on top of each other to get the
>> same features represented across ligands? For example is the below
>> representation correct?
>>
>> | sample | feature1 | feature2 | feature3 |
>> |:   |::|::|-:|
>> | 1  | bit 1| bit 2| bit 3|
>> | 2  | bit 1| bit 2| bit 3|
>> | 3  | bit 1| bit 2| bit 3|
>>
>> So basically is feature 1, 2, 3 etc always one type of substructure no
>> matter what the input molecule is? What happens if the 2048 bits or
>> substructures predesignated in rdkit do not contain a new substructure in a
>> molecule we are evaluating?
>>
>> Any advice on how to reduce features and then use that reduced feature
>> list for new molecules after training a model would also be appreciated.
>> How would the model only extract the reduced bits for a new ligand if I
>> remove low variance bits from the training set for example?
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
> --
> Rafael da Fonseca Lameiro
> [image: orcid logo 16px] https://orcid.org/-0003-4466-2682
> Aluno de Doutorado - Grupo de Química Medicinal e Biológica (NEQUIMED)
> Instituto de Química de São Carlos - Universidade de São Paulo - Brasil
> Av. Trabalhador Sancarlense, 400 - CEP: 13566-590 - São Carlos/SP
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>


-- 
Rajarshi Guha | http://blog.rguha.net | @rguha 
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] MFP question about similar substructures and feature reduction

2021-09-29 Thread Rafael L via Rdkit-discuss
Hello, your question prompted me to write a small notebook, which I hope
you may find useful:
https://github.com/rflameiro/projects/blob/main/comparing_fingerprint_bits.ipynb

In summary, bits that are active in both fingerprints usually correspond to
the same substructure, unless bit collision happens. You can verify that by
drawing the substructure that activates a certain bit using the
function Draw.DrawMorganBit().

-- What happens if the 2048 bits or substructures predesignated in rdkit do
not contain a new substructure in a molecule we are evaluating?
If I understand correctly, you want to know what will a fingerprint look
like for a molecule that doesn't have new substructures compared to a
previously calculated fingerprint. In this case, the new fingerprint will
be the same (although this is more common when working with MACCS
fingerprints, which work with a predetermined set of substructures), or the
new molecule will have less substructures than the previous one, and less
bits will be active.

-- Any advice on how to reduce features and then use that reduced feature
list for new molecules after training a model would also be appreciated.
How would the model only extract the reduced bits for a new ligand if I
remove low variance bits from the training set for example?
To build models on fingerprints, you can start using the complete set of
2048 bits, and compare the performance with fingerprints containing less
bits (1024, 512...). A good starting point is:
https://www.moreisdifferent.com/2017/9/21/DIY-Drug-Discovery-using-molecular-fingerprints-and-machine-learning-for-solubility-prediction/
You should see a drop in performance as the bit size decreases, as bit
collisions are more likely.
Alternatively, you could try reducing the dimensionality by using a
technique such as PCA, but use enough PCs to get a reasonable explained
variance percentage. It is easy to calculate PCs with scikit-learn. Then,
to apply it in new fingerprints, you will only have to call .transform().
See:
https://stackabuse.com/implementing-pca-in-python-with-scikit-learn/

Em seg., 27 de set. de 2021 às 20:35, Natasha Gupta 
escreveu:

> Hello,
>
> Apologies. this is a very basic question:
> If I am converting many ligands into morgan fingerprints, could I
> theoretically stack the bit representations on top of each other to get the
> same features represented across ligands? For example is the below
> representation correct?
>
> | sample | feature1 | feature2 | feature3 |
> |:   |::|::|-:|
> | 1  | bit 1| bit 2| bit 3|
> | 2  | bit 1| bit 2| bit 3|
> | 3  | bit 1| bit 2| bit 3|
>
> So basically is feature 1, 2, 3 etc always one type of substructure no
> matter what the input molecule is? What happens if the 2048 bits or
> substructures predesignated in rdkit do not contain a new substructure in a
> molecule we are evaluating?
>
> Any advice on how to reduce features and then use that reduced feature
> list for new molecules after training a model would also be appreciated.
> How would the model only extract the reduced bits for a new ligand if I
> remove low variance bits from the training set for example?
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>


-- 
Rafael da Fonseca Lameiro
[image: orcid logo 16px] https://orcid.org/-0003-4466-2682
Aluno de Doutorado - Grupo de Química Medicinal e Biológica (NEQUIMED)
Instituto de Química de São Carlos - Universidade de São Paulo - Brasil
Av. Trabalhador Sancarlense, 400 - CEP: 13566-590 - São Carlos/SP
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss