Dear all,

I'm attaching a proposed dataset for doing SSS testing and
optimization work. The set consists of 10K molecules selected from the
Pubchem screening deck. Salts were stripped from molecules that had
them. These are in the attached file mols.10K.txt.gz

To generate queries for testing, I applied RECAP to 5K of these
molecules (randomly selected). The RDKit RECAP implementation leaves
dummy atoms indicating attachment points on molecules, since these
don't seem useful in an SSS test, I remove the dummies before doing
duplicate removal; I didn't do this before, which is why the duplicate
queries Andrew noticed were present. This process yields 4066 queries.
These are in the attached file queries.5K.txt.gz  A note about this
file: these look like SMILES, but they should be interpreted as
SMARTS. This means the query "CC" corresponds to two aliphatic carbons
connected by a single bond, so it should not match things like
Cc1ccccc1.

I've also attached histograms of the number of atoms present in the
molecules and queries.

I'd really like to hear suggestions about ways to improve this set or
the process for generating it. After some time for comment, I will
find some place in the RDKit svn repository to deposit these files
along with information about how many times each query should match
each molecule.

In case anyone is interested, the code used to decompose the molecules
and create queries is here:
http://pastebin.com/m8b935c

-greg

Attachment: mols.10K.txt.gz
Description: GNU Zip compressed data

Attachment: queries.5K.txt.gz
Description: GNU Zip compressed data

<<attachment: MolAtomCounts.png>>

<<attachment: QueryAtomCounts.png>>

Reply via email to