On Jan 11, 2018, at 12:04, Wandré <wandrevel...@gmail.com> wrote: > Thanks for the link. It is very interesting. I will read very carefully. > So, as input on ChemFP, I have to put a file with all molecules in 1 SDF?
Chemfp works with fingerprint files, in your case, chemfp's text-based "FPS" format. You'll need to use 'rdkit2fps' to convert your InChI structures into a fingerprint. Here's an example file, where I follow the Open Babel convention of allowing an identifier after the InChI string: % cat examples.inchi InChI=1S/C6H6O/c7-6-4-2-1-3-5-6/h1-5,7H phenol InChI=1S/C6H6/c1-2-4-6-5-3-1/h1-6H benzene InChI=1S/CH4/h1H4/i1D4 deuterated methane You could also use an SDF or SMILES file. Next, I generate AtomPair fingerprints. The output goes to "examples.fps", which I'll then display. % rdkit2fps --pairs examples.inchi -o examples.fps % cat examples.fps #FPS1 #num_bits=2048 #type=RDKit-AtomPair/2 fpSize=2048 minLength=1 maxLength=30 #software=RDKit/2016.09.3 chemfp/3.1 #source=examples.inchi #date=2018-01-11T14:38:57 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000100000000000000001000000000000000000000000000000000000310000000003000000000000000000000000000000000000000000007003000000000000000000000300000000000000000000000000000000000000073000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 phenol 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000300000000000000000000000000000000000000000000000000000007000000000000000000000000000000000000000000000000000000000000000070000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 benzene 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000007000000000000000000000000000000000000000000000000000000000000000000000000070000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000 deuterated methane Finally, I run the clustering program, with a low threshold so it does something other than the trivial output of three clusters. % python taylor_butina.py -t 0.3 examples.fps 0 true singletons => 1 false singletons => deuterated methane 1 clusters phenol has 1 other members => benzene This output format is rather ad hoc. I need to figure out what format people want from a clustering tool; preferably one that other tools can import without further conversion. I'll be glad to hear any suggestions. Cheers, Andrew da...@dalkescientific.com ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss