Thank you so much! This is so clear and very helpful. On Wednesday, September 29, 2021, Rafael L <rafael.lame...@alumni.usp.br> wrote:
> Hello, your question prompted me to write a small notebook, which I hope > you may find useful: > https://github.com/rflameiro/projects/blob/main/comparing_ > fingerprint_bits.ipynb > > In summary, bits that are active in both fingerprints usually correspond > to the same substructure, unless bit collision happens. You can verify that > by drawing the substructure that activates a certain bit using the > function Draw.DrawMorganBit(). > > -- What happens if the 2048 bits or substructures predesignated in rdkit > do not contain a new substructure in a molecule we are evaluating? > If I understand correctly, you want to know what will a fingerprint look > like for a molecule that doesn't have new substructures compared to a > previously calculated fingerprint. In this case, the new fingerprint will > be the same (although this is more common when working with MACCS > fingerprints, which work with a predetermined set of substructures), or the > new molecule will have less substructures than the previous one, and less > bits will be active. > > -- Any advice on how to reduce features and then use that reduced feature > list for new molecules after training a model would also be appreciated. > How would the model only extract the reduced bits for a new ligand if I > remove low variance bits from the training set for example? > To build models on fingerprints, you can start using the complete set of > 2048 bits, and compare the performance with fingerprints containing less > bits (1024, 512...). A good starting point is: > https://www.moreisdifferent.com/2017/9/21/DIY-Drug- > Discovery-using-molecular-fingerprints-and-machine- > learning-for-solubility-prediction/ > You should see a drop in performance as the bit size decreases, as bit > collisions are more likely. > Alternatively, you could try reducing the dimensionality by using a > technique such as PCA, but use enough PCs to get a reasonable explained > variance percentage. It is easy to calculate PCs with scikit-learn. Then, > to apply it in new fingerprints, you will only have to call .transform(). > See: > https://stackabuse.com/implementing-pca-in-python-with-scikit-learn/ > > Em seg., 27 de set. de 2021 às 20:35, Natasha Gupta <ngupt...@gmail.com> > escreveu: > >> Hello, >> >> Apologies. this is a very basic question: >> If I am converting many ligands into morgan fingerprints, could I >> theoretically stack the bit representations on top of each other to get the >> same features represented across ligands? For example is the below >> representation correct? >> >> | sample | feature1 | feature2 | feature3 | >> |:---- |:--------:|:--------:|---------:| >> | 1 | bit 1 | bit 2 | bit 3 | >> | 2 | bit 1 | bit 2 | bit 3 | >> | 3 | bit 1 | bit 2 | bit 3 | >> >> So basically is feature 1, 2, 3 etc always one type of substructure no >> matter what the input molecule is? What happens if the 2048 bits or >> substructures predesignated in rdkit do not contain a new substructure in a >> molecule we are evaluating? >> >> Any advice on how to reduce features and then use that reduced feature >> list for new molecules after training a model would also be appreciated. >> How would the model only extract the reduced bits for a new ligand if I >> remove low variance bits from the training set for example? >> >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > > > -- > Rafael da Fonseca Lameiro > [image: orcid logo 16px] https://orcid.org/0000-0003-4466-2682 > Aluno de Doutorado - Grupo de Química Medicinal e Biológica (NEQUIMED) > Instituto de Química de São Carlos - Universidade de São Paulo - Brasil > Av. Trabalhador Sancarlense, 400 - CEP: 13566-590 - São Carlos/SP >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss