Re: [Rdkit-discuss] Use fingerprint do Clustering a large dataset of molecules

Wandré Thu, 11 Jan 2018 09:50:54 -0800

Thanks Andrew, I will try this steps.
So, to avoid recalculate fingerprints, how can I calculate them and store
in database?
When I calculate AtomPair fingerprint, returns
a rdkit.DataStructs.cDataStructs.IntSparseIntVect object
How to store this rdkit Python object in a database and how to read them
again?


--
Wandré Nunes de Pinho Veloso
Professor Assistente - Unifei - Campus Avançado de Itabira-MG
Doutorando em Bioinformática - Universidade Federal de Minas Gerais - UFMG
Pesquisador do INSILICO - Grupo Interdisciplinar em Simulação e
Inteligência Computacional - UNIFEI
Membro do Grupo de Pesquisa Assinaturas Biológicas da FIOCRUZ
Membro do Grupo de Pesquisa Bioinformática Estrutural da UFMG
Laboratório de Bioinformática e Sistemas - LBS, DCC, UFMG

2018-01-11 12:46 GMT-02:00 Andrew Dalke <da...@dalkescientific.com>:

> On Jan 11, 2018, at 12:04, Wandré <wandrevel...@gmail.com> wrote:
> > Thanks for the link. It is very interesting. I will read very carefully.
> > So, as input on ChemFP, I have to put a file with all molecules in 1 SDF?
>
> Chemfp works with fingerprint files, in your case, chemfp's text-based
> "FPS" format. You'll need to use 'rdkit2fps' to convert your InChI
> structures into a fingerprint.
>
> Here's an example file, where I follow the Open Babel convention of
> allowing an identifier after the InChI string:
>
> % cat examples.inchi
> InChI=1S/C6H6O/c7-6-4-2-1-3-5-6/h1-5,7H phenol
> InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H benzene
> InChI=1S/CH4/h1H4/i1D4 deuterated methane
>
> You could also use an SDF or SMILES file.
>
> Next, I generate AtomPair fingerprints. The output goes to "examples.fps",
> which I'll then display.
>
> % rdkit2fps --pairs examples.inchi -o examples.fps
> % cat examples.fps
> #FPS1
> #num_bits=2048
> #type=RDKit-AtomPair/2 fpSize=2048 minLength=1 maxLength=30
> #software=RDKit/2016.09.3 chemfp/3.1
> #source=examples.inchi
> #date=2018-01-11T14:38:57
> 000000000000000000000000000000000000000000000000000000000000
> 000000000000000000000000000000000000000000000000000000000000
> 000000000000000000000000000000000000000000000000000000000000
> 000000000000000000001000000000000000010000000000000000000000
> 000000000000003100000000030000000000000000000000000000000000
> 000000000070030000000000000000000003000000000000000000000000
> 000000000000000730000000000000000000000000000000000000000000
> 000000000000000000000000000000000000000000000000000000000000
> 00000000000000000000000000000000        phenol
> 000000000000000000000000000000000000000000000000000000000000
> 000000000000000000000000000000000000000000000000000000000000
> 000000000000000000000000000000000000000000000000000000000000
> 000000000000000000000000000000000000000000000000000000000000
> 000000000000003000000000000000000000000000000000000000000000
> 000000000070000000000000000000000000000000000000000000000000
> 000000000000000700000000000000000000000000000000000000000000
> 000000000000000000000000000000000000000000000000000000000000
> 00000000000000000000000000000000        benzene
> 000000000000000000000000000000000000000000000000000000000000
> 000000000000000000000000000000000000000000000000000000000000
> 000000000000000000000000000000000000000000000000000000000000
> 000000000000000000000000000000000000000000000000000070000000
> 000000000000000000000000000000000000000000000000000000000000
> 000000700000000000000000000000000000000000000000000000000000
> 000000000000000000000000000000000000000000000000000000000000
> 000000000000000000000000000000000000000000000000000000000000
> 00000000000000000000000000000000        deuterated methane
>
>
> Finally, I run the clustering program, with a low threshold so it does
> something other than the trivial output of three clusters.
>
> % python taylor_butina.py -t 0.3 examples.fps
> 0 true singletons
> =>
>
> 1 false singletons
> => deuterated methane
>
> 1 clusters
> phenol has 1 other members
> => benzene
>
> This output format is rather ad hoc. I need to figure out what format
> people want from a clustering tool; preferably one that other tools can
> import without further conversion.
>
> I'll be glad to hear any suggestions.
>
> Cheers,
>
>
>                                 Andrew
>                                 da...@dalkescientific.com
>
>
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Use fingerprint do Clustering a large dataset of molecules

Reply via email to