Hi Greg and thank you for the prompt reply!

The reason why I chose 4096 bits instead of the default is because you
demonstrated in an older blog post (
http://rdkit.blogspot.com/2013/11/fingerprint-based-substructure.html) that
it offers an improvement in the accuracy over the default:

Pattern-2K Pieces 0.572 Fragments 0.590 Leads 0.715
Pattern-4K Pieces 0.635 Fragments 0.602 Leads 0.729

I presume that it would be very simple and useful to add the 'fpSize'
option in the SubstructLibrary constructor.
Btw, neither Brian's workaround allows explicit specification of bitvector
size. Both lines below returned an error.

fps.AddFingerprint( fps.MakeFingerprint(mol2, fpSize=4096) )

fps.AddFingerprint( fps.MakeFingerprint(mol2, nBits=4096) )


For the time being, I will use 2048 bits but it would be good to be able to
control it in the future.

Best wishes,
Thomas






On Mon, 31 Aug 2020 at 16:34, Greg Landrum <greg.land...@gmail.com> wrote:

> Hi Thomas,
>
> I agree that this is a much better place to ask a question than in the
> comments of my blog post. :-)
>
> The problem you're having here is that the PatternHolder() class assumes
> that the fingerprints being used have the default size, so you are storing
> fingerprints with 4096 bits, but when the SubstructLibrary generates a
> fingerprint for a query molecule it only generates a 2048-bit fingerprint.
> This causes the substructure screenout to fail.
> This is certainly a bug in the SubstructLibrary (it should, at the very
> least, generate an error when you try to do this), but it's easy enough to
> fix in your code: just stop specifying the length of the pattern
> fingerprints.
>
> Best,
> -greg
>
>
>
> On Mon, Aug 31, 2020 at 3:57 PM Thomas Evangelidis <teva...@gmail.com>
> wrote:
>
>> Greetings,
>>
>> Maybe I should had posted this query as a comment on Greg's blog post (
>> https://rdkit.blogspot.com/2018/02/introducing-substructlibrary.html)
>> but I write it here instead for greater visibility. I have many active 
>> fragments
>> against a protein target (validated by NMR) and I want to screen a very
>> large database for molecules containing those fragments. Therefore I
>> tried the SubstructLibrary for greater efficiency. However, the results
>> I get differ from direct PatternFingerprint comparison and substructure
>> search using the Mol object. Try this simple example below:
>>
>> from rdkit import Chem, DataStructs
>> from rdkit.Chem import rdSubstructLibrary
>>
>> SMILES1 = 'O=C(O)c1cccnc1'
>> SMILES2 = 'c1nccc(c1C(=O)O)-c2cc(Cl)ccc2'
>> # Remove hydrogens, otherwise you will have to modify the valence of the 
>> atoms in the fragment
>> # that can facilitate extension by hand
>> mol1 = Chem.RemoveHs( Chem.MolFromSmiles(SMILES1, sanitize=False) )
>> mol2 = Chem.RemoveHs( Chem.MolFromSmiles(SMILES2, sanitize=False) )
>>
>> # AVENUE 1: Library
>> mols2 = rdSubstructLibrary.CachedTrustedSmilesMolHolder()
>> mols2.AddSmiles( Chem.MolToSmiles(mol2) )
>> fps = rdSubstructLibrary.PatternHolder()
>> fp2 = Chem.PatternFingerprint(mol2, fpSize=4096)
>> fps.AddFingerprint( fp2 )
>> library = rdSubstructLibrary.SubstructLibrary(mols2, fps)
>> print("SubstructLibrary:", library.HasMatch(mol1, useChirality=False) )
>>
>> # AVENUE 2: PatternFingerprint comparison
>> fp1 = Chem.PatternFingerprint(mol1, fpSize=4096)
>> print("PatternFingerprint:", DataStructs.AllProbeBitsMatch(fp1, fp2))
>>
>> # AVENUE 3: HasSubstructMatch
>> print("HasSubstructMatch:", mol2.HasSubstructMatch(mol1))
>>
>>
>> I strip out the hydrogens from both molecules in order to avoid manual
>> modification of the atoms in the fragment (SMILES1 in this case) that can
>> facilitate linking or extension. What is wrong in this case and the results
>> do not agree? Am I not using SubstructLibrary correctly?
>>
>> I thank you in advance.
>> Thomas
>>
>> --
>>
>> ======================================================================
>>
>> Dr. Thomas Evangelidis
>>
>> Research Scientist
>>
>> IOCB - Institute of Organic Chemistry and Biochemistry of the Czech
>> Academy of Sciences <https://www.uochb.cz/web/structure/31.html?lang=en>
>> , Prague, Czech Republic
>>   &
>> CEITEC - Central European Institute of Technology
>> <https://www.ceitec.eu/>, Brno, Czech Republic
>>
>> email: teva...@gmail.com, Twitter: tevangelidis
>> <https://twitter.com/tevangelidis>, LinkedIn: Thomas Evangelidis
>> <https://www.linkedin.com/in/thomas-evangelidis-495b45125/>
>>
>> website: https://sites.google.com/site/thomasevangelidishomepage/
>>
>>
>>
>> _______________________________________________
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>

-- 

======================================================================

Dr. Thomas Evangelidis

Research Scientist

IOCB - Institute of Organic Chemistry and Biochemistry of the Czech Academy
of Sciences <https://www.uochb.cz/web/structure/31.html?lang=en>, Prague,
Czech Republic
  &
CEITEC - Central European Institute of Technology
<https://www.ceitec.eu/>, Brno,
Czech Republic

email: teva...@gmail.com, Twitter: tevangelidis
<https://twitter.com/tevangelidis>, LinkedIn: Thomas Evangelidis
<https://www.linkedin.com/in/thomas-evangelidis-495b45125/>

website: https://sites.google.com/site/thomasevangelidishomepage/
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to