Dear all, I'm attaching a proposed dataset for doing SSS testing and optimization work. The set consists of 10K molecules selected from the Pubchem screening deck. Salts were stripped from molecules that had them. These are in the attached file mols.10K.txt.gz
To generate queries for testing, I applied RECAP to 5K of these molecules (randomly selected). The RDKit RECAP implementation leaves dummy atoms indicating attachment points on molecules, since these don't seem useful in an SSS test, I remove the dummies before doing duplicate removal; I didn't do this before, which is why the duplicate queries Andrew noticed were present. This process yields 4066 queries. These are in the attached file queries.5K.txt.gz A note about this file: these look like SMILES, but they should be interpreted as SMARTS. This means the query "CC" corresponds to two aliphatic carbons connected by a single bond, so it should not match things like Cc1ccccc1. I've also attached histograms of the number of atoms present in the molecules and queries. I'd really like to hear suggestions about ways to improve this set or the process for generating it. After some time for comment, I will find some place in the RDKit svn repository to deposit these files along with information about how many times each query should match each molecule. In case anyone is interested, the code used to decompose the molecules and create queries is here: http://pastebin.com/m8b935c -greg
mols.10K.txt.gz
Description: GNU Zip compressed data
queries.5K.txt.gz
Description: GNU Zip compressed data
<<attachment: MolAtomCounts.png>>
<<attachment: QueryAtomCounts.png>>

