Hello, your question prompted me to write a small notebook, which I hope you may find useful: https://github.com/rflameiro/projects/blob/main/comparing_fingerprint_bits.ipynb
In summary, bits that are active in both fingerprints usually correspond to the same substructure, unless bit collision happens. You can verify that by drawing the substructure that activates a certain bit using the function Draw.DrawMorganBit(). -- What happens if the 2048 bits or substructures predesignated in rdkit do not contain a new substructure in a molecule we are evaluating? If I understand correctly, you want to know what will a fingerprint look like for a molecule that doesn't have new substructures compared to a previously calculated fingerprint. In this case, the new fingerprint will be the same (although this is more common when working with MACCS fingerprints, which work with a predetermined set of substructures), or the new molecule will have less substructures than the previous one, and less bits will be active. -- Any advice on how to reduce features and then use that reduced feature list for new molecules after training a model would also be appreciated. How would the model only extract the reduced bits for a new ligand if I remove low variance bits from the training set for example? To build models on fingerprints, you can start using the complete set of 2048 bits, and compare the performance with fingerprints containing less bits (1024, 512...). A good starting point is: https://www.moreisdifferent.com/2017/9/21/DIY-Drug-Discovery-using-molecular-fingerprints-and-machine-learning-for-solubility-prediction/ You should see a drop in performance as the bit size decreases, as bit collisions are more likely. Alternatively, you could try reducing the dimensionality by using a technique such as PCA, but use enough PCs to get a reasonable explained variance percentage. It is easy to calculate PCs with scikit-learn. Then, to apply it in new fingerprints, you will only have to call .transform(). See: https://stackabuse.com/implementing-pca-in-python-with-scikit-learn/ Em seg., 27 de set. de 2021 às 20:35, Natasha Gupta <ngupt...@gmail.com> escreveu: > Hello, > > Apologies. this is a very basic question: > If I am converting many ligands into morgan fingerprints, could I > theoretically stack the bit representations on top of each other to get the > same features represented across ligands? For example is the below > representation correct? > > | sample | feature1 | feature2 | feature3 | > |:---- |:--------:|:--------:|---------:| > | 1 | bit 1 | bit 2 | bit 3 | > | 2 | bit 1 | bit 2 | bit 3 | > | 3 | bit 1 | bit 2 | bit 3 | > > So basically is feature 1, 2, 3 etc always one type of substructure no > matter what the input molecule is? What happens if the 2048 bits or > substructures predesignated in rdkit do not contain a new substructure in a > molecule we are evaluating? > > Any advice on how to reduce features and then use that reduced feature > list for new molecules after training a model would also be appreciated. > How would the model only extract the reduced bits for a new ligand if I > remove low variance bits from the training set for example? > > _______________________________________________ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > -- Rafael da Fonseca Lameiro [image: orcid logo 16px] https://orcid.org/0000-0003-4466-2682 Aluno de Doutorado - Grupo de Química Medicinal e Biológica (NEQUIMED) Instituto de Química de São Carlos - Universidade de São Paulo - Brasil Av. Trabalhador Sancarlense, 400 - CEP: 13566-590 - São Carlos/SP
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss