Re: [Rdkit-discuss] Incorrect results for substructure search obtained with Tversky similarity.

2016-12-12 Thread Greg Landrum
Hi Axel,
The RDKit's Morgan Fingerprint is not a substructure screening fingerprint. If 
you want to use a fingerprint for screening, your best bet is the Pattern 
fingerprint.
As an aside, the RDKit has a function, DataStructs.AllProbeBitsMatch 
(http://www.rdkit.org/Python_Docs/rdkit.DataStructs.cDataStructs-module.html#AllProbeBitsMatch)
 which is a somewhat more efficient way of doing the check you're looking to do.
-greg






On Mon, Dec 12, 2016 at 7:06 PM +0100, "Axel Rudling"  
wrote:












Hi Brian and thank you for your respons. 

Yes, so Tversky with alpha parameter set to 1.0 and a cutoff for the similarity 
at 1.0 (100 % of me in you) will equal substucture search, at least at a 
theoretical level. I guess my question is, does imperfections in the fp model 
likley to generate these kind of results? So I use ecfp4 with 2048 bits.


Regards

Axel


On Dec 12, 2016 6:57 PM, "Brian Kelley"  wrote:
I'm not really sure what you mean by tversky searching in substructure mode.
Fingerprinting methods do not guarantee the presence of an exact substructure.  
You can think of tversky asking what percentage of me is in you and that 
percentage doesn't have to be a substructure.  However they are correlated in 
that a good screening fingerprint can throw out molecules that will never be a 
substructure match.  You still have to check the substructure match however.
Using a screen fingerprint to filter out true negatives, I generally go from 
5-10k substructure matches/sec to around 500-600k/sec in real world searches.  
I'm happy to provide an example of this if you need it.
I hope this helps.

Brian Kelley
On Dec 12, 2016, at 11:29 AM, Axel Rudling  wrote:



Hello all,

Currently I'm doing a project with Tversky searching in substructure mode and 
use smiles for creating fingerprints.

For most molecules I get the correct result but there are some molecules where 
I get an overflow of falsely predicted substructure molecules. In brief, I get 
a large amount of compounds as a result from the substructure search that are 
not actually substructures of the query compound. I'm not certain of why but it 
might have to do with the FP representation as these molecules have a very 
unusual curricular structure ex.:

C1C[NH2+]CCC[NH2+]CCCNCCC[NH2+]C1




I use 2048-bit ECFP4 fingerprints.


tverskySim = DataStructs.TverskySimilarity(ffp1,ffp2,1.0,0.0)

Does anyone have an idea?




best

Axel

--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! 
http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss






--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Incorrect results for substructure search obtained with Tversky similarity.

2016-12-12 Thread Axel Rudling
Hi Brian and thank you for your respons.
Yes, so Tversky with alpha parameter set to 1.0 and a cutoff for the
similarity at 1.0 (100 % of me in you) will equal substucture search, at
least at a theoretical level. I guess my question is, does imperfections in
the fp model likley to generate these kind of results? So I use ecfp4 with
2048 bits.

Regards
Axel
On Dec 12, 2016 6:57 PM, "Brian Kelley"  wrote:

> I'm not really sure what you mean by tversky searching in substructure
> mode.
>
> Fingerprinting methods do not guarantee the presence of an exact
> substructure.  You can think of tversky asking what percentage of me is in
> you and that percentage doesn't have to be a substructure.  However they
> are correlated in that a good screening fingerprint can throw out molecules
> that will never be a substructure match.  You still have to check the
> substructure match however.
>
> Using a screen fingerprint to filter out true negatives, I generally go
> from 5-10k substructure matches/sec to around 500-600k/sec in real world
> searches.  I'm happy to provide an example of this if you need it.
>
> I hope this helps.
>
> 
> Brian Kelley
>
> On Dec 12, 2016, at 11:29 AM, Axel Rudling  wrote:
>
> Hello all,
>
> Currently I'm doing a project with Tversky searching in substructure mode
> and use smiles for creating fingerprints.
>
> For most molecules I get the correct result but there are some molecules
> where I get an overflow of falsely predicted substructure molecules. In
> brief, I get a large amount of compounds as a result from the substructure
> search that are not actually substructures of the query compound. I'm not
> certain of why but it might have to do with the FP representation as these
> molecules have a very unusual curricular structure ex.:
>
> C1C[NH2+]CCC[NH2+]CCCNCCC[NH2+]C1
>
>
> I use 2048-bit ECFP4 fingerprints.
>
> tverskySim = DataStructs.TverskySimilarity(ffp1,ffp2,1.0,0.0)
>
> Does anyone have an idea?
>
>
> best
>
> Axel
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Incorrect results for substructure search obtained with Tversky similarity.

2016-12-12 Thread Brian Kelley
I'm not really sure what you mean by tversky searching in substructure mode.

Fingerprinting methods do not guarantee the presence of an exact substructure.  
You can think of tversky asking what percentage of me is in you and that 
percentage doesn't have to be a substructure.  However they are correlated in 
that a good screening fingerprint can throw out molecules that will never be a 
substructure match.  You still have to check the substructure match however.

Using a screen fingerprint to filter out true negatives, I generally go from 
5-10k substructure matches/sec to around 500-600k/sec in real world searches.  
I'm happy to provide an example of this if you need it.

I hope this helps.


Brian Kelley

> On Dec 12, 2016, at 11:29 AM, Axel Rudling  wrote:
> 
> Hello all,
> 
> Currently I'm doing a project with Tversky searching in substructure mode and 
> use smiles for creating fingerprints.
> 
> For most molecules I get the correct result but there are some molecules 
> where I get an overflow of falsely predicted substructure molecules. In 
> brief, I get a large amount of compounds as a result from the substructure 
> search that are not actually substructures of the query compound. I'm not 
> certain of why but it might have to do with the FP representation as these 
> molecules have a very unusual curricular structure ex.:
> 
> C1C[NH2+]CCC[NH2+]CCCNCCC[NH2+]C1
> 
> 
> I use 2048-bit ECFP4 fingerprints.
> 
> tverskySim = DataStructs.TverskySimilarity(ffp1,ffp2,1.0,0.0)
> 
> Does anyone have an idea?
> 
> 
> 
> best
> 
> Axel
> 
> 
> --
> Check out the vibrant tech community on one of the world's most 
> engaging tech sites, SlashDot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Incorrect results for substructure search obtained with Tversky similarity.

2016-12-12 Thread Axel Rudling
Hello all,

Currently I'm doing a project with Tversky searching in substructure mode
and use smiles for creating fingerprints.

For most molecules I get the correct result but there are some molecules
where I get an overflow of falsely predicted substructure molecules. In
brief, I get a large amount of compounds as a result from the substructure
search that are not actually substructures of the query compound. I'm not
certain of why but it might have to do with the FP representation as these
molecules have a very unusual curricular structure ex.:

C1C[NH2+]CCC[NH2+]CCCNCCC[NH2+]C1


I use 2048-bit ECFP4 fingerprints.

tverskySim = DataStructs.TverskySimilarity(ffp1,ffp2,1.0,0.0)

Does anyone have an idea?


best

Axel
--
Check out the vibrant tech community on one of the world's most 
engaging tech sites, SlashDot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss