Re: [Rdkit-discuss] How does rdkit cartridge work?

2020-01-24 Thread Greg Landrum
Hi Changge,

On Fri, Jan 24, 2020 at 5:14 PM Chicago Ji  wrote:

>
> I find that rdkit cartridge is quite efficient in substructure searching.
>

Glad to hear that! :-)


> Is there any paper or similar paper that describes things behind rdkit
> cartridge?
>

No, just the documentation.


> For example, what kind of substructures were indexed?
>

The Pattern fingerprint is what's used to build the index for substructure
searching. That fingerprint is described in the RDKit documentation here:
https://www.rdkit.org/docs/RDKit_Book.html#pattern-fingerprints


> Is there a way that the users can add custom defined fingerprints and
> substructures?
>

I'm not sure what you mean by substructures, but you can, from Python, use
custom fingerprints in the cartridge. That's explained here:
http://rdkit.blogspot.com/2017/04/using-custom-fingerprint-in-postgresql.html

Best,
-greg



> Many thanks for your help!
>
> Best,
> Changge
>
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Observations about RDKit performance: PatternFingerprinter, Windows, Linux and Virtual machines

2020-01-24 Thread Thomas Strunz
Hi Jan,

yeah this numpy test might not be ideal due to depending a lot on how numpy was 
built, namley the underlying BLAS library that was used. I suspect your Debian 
VM doesn't use openblas or mkl.
You can check that with

np.__config__.show()

For n = 6000 native (Windows 10) takes 2s and VM (Lubuntu 16.04) takes 2.5s. 
While the difference again is roughly 50%, I'm hesitant as here we really are 
not comparing apples to apples as windows uses Intel mkl vs openblas on linux.

Best Regards,

Thomas


Von: Jan Holst Jensen
Gesendet: Freitag, 24. Januar 2020 13:43
Bis: Thomas Strunz; Maciek Wójcikowski
Cc: rdkit-discuss@lists.sourceforge.net
Betreff: Re: [Rdkit-discuss] Observations about RDKit performance: 
PatternFingerprinter, Windows, Linux and Virtual machines

Hi Thomas,

FWIW I ran your example code below on my VM host (CentOS 7.3, Intel(R) Xeon(R) 
CPU E3-1245 v6 @ 3.70GHz) and in a Linux VM (Debian 9).

n = 6000Host = 3.8 secsVM = 145 secs~40 times slower
n = 1000Host = 0.03 secsVM = 0.6 secs~20 times slower

So based on these timings, your 50% penalty in the VM sounds really good :-).

Now, the example maxes out all available cores on the host, but sticks to a 
single core in the VM. I don't know the reason for that, but perhaps differing 
build options for numpy on CentOS <> Debian ? That explains roughly a factor 8 
for me (CPU has 4 cores, 8 threads). Still, after correcting for active core 
count, the VM will end up taking 2 - 5 times as long as the host.

For other workloads I generally don't see such a dramatic difference; more like 
10-30% slower performance in VMs compared to native. Seems like you have hit a 
particular VM weak spot with your workload.

If container deployment is an alternative option instead of VM, perhaps that 
would improve matters ? Of course, that won't help you if you need to deploy on 
Windows.

Cheers
-- Jan

On 2020-01-24 09:30, Thomas Strunz wrote:
Hi Maciek,

yeah I thought that this could be the issue as well but according to the tools 
(grep flags /proc/cpuinfo | uniq) or coreinfo on windows the VMs also support 
sse4.2 (and lower) and AVX.

In fact I seem to have to look further as I noticed that in general python 
performance (and possible more, not tested) is much slower on the VMs. See 
below code which is actually a way to see performance impact of vector 
extension and especially of intel mkl.

import numpy as np
import time

n = 2
A = np.random.randn(n,n).astype('float64')
B = np.random.randn(n,n).astype('float64')

start_time = time.time()
nrm = np.linalg.norm(A@B)
print(" took {} seconds ".format(time.time() - start_time))
print(" norm = ",nrm)


Last code fragment runs about 50% slower on the Windows VM compared to my 
laptop accounting for clock and core count differences. It's confusing to me as 
the performance difference is so consitent and apparent but I would assume if 
this was normal people would have noticed a long time ago?Yet I can't find 
anything about it. Or does everyone run their code native?

Best Regards,

Thomas


Von: Maciek Wójcikowski 
Gesendet: Donnerstag, 23. Januar 2020 11:04
An: Thomas Strunz 
Cc: Greg Landrum ; 
rdkit-discuss@lists.sourceforge.net 

Betreff: Re: [Rdkit-discuss] Observations about RDKit performance: 
PatternFingerprinter, Windows, Linux and Virtual machines

Thomas,

Could you double check if your VM has the same set of instructions as your 
host? For hardware popcounts, which are used to accelerate fingerprint 
operations, they might have profound impact on performance. SSE4.2 is probably 
the one that is used in the RDKit (at least this is stated in the code).

For KVM https://www.linux-kvm.org/page/Tuning_KVM (there are linux commands to 
check what is available on guest, so might be helpful for you too).
It also seems that in VMWare world this might be tricky, as it is considered to 
be a stability hazard: 
https://pubs.vmware.com/vsphere-50/index.jsp?topic=%2Fcom.vmware.vsphere.vcenterhost.doc_50%2FGUID-8B226625-4923-410C-B7AF-51BCD2806A3B.html

Best,
Maciek


Pozdrawiam,  |  Best regards,
Maciek Wójcikowski
mac...@wojcikowski.pl


czw., 23 sty 2020 o 08:15 Thomas Strunz 
mailto:beginn...@hotmail.de>> napisał(a):
Hi Greg,

reopening this old question. I can see that there are potential differences 
between rdkit version and especially Linux and Windows but let's lieave that 
aside for now.

After further "playing around" however I really have the impression there is a 
real issue with running rdkit (or python?) in a virtualized operating sytem. 
Since most production software and/or when using the cloud will mostly run in a 
virtualized operating system, I think this should be a fairly relevant topic 

Re: [Rdkit-discuss] Highlighting some parts of a structure

2020-01-24 Thread Greg Landrum
[adding the mailing list back in]

Hi Alexis,

It's not currently possible to change the colormap that's being used with
the new drawing code.
It should be, so I created an issue for it:
https://github.com/rdkit/rdkit/issues/2904

-greg




On Thu, Jan 23, 2020 at 7:00 PM Alexis Parenty <
alexis.parenty.h...@gmail.com> wrote:

> Hi again, one last question, I used to use the colorMap argument "RdBu" to
> get a nice blue against red color coding rather than the default green /
> purple (that I also use to highlight another property). The following does
> not produce blue/red color code anymore:
>
> d = Draw.MolDraw2DCairo(400, 400)
> SimilarityMaps.GetSimilarityMapFromWeights(mol, coverage_list, draw2d=d, 
> colorMap='RdBu')
> d.FinishDrawing()
> img = show_png(d.GetDrawingText())
> img.save("applicability_domain_glowing_molecule.png")
>
> How can I recover the blue/red colors?
> Many thanks,
> Alexis
>
> On Thu, 23 Jan 2020 at 17:09, Alexis Parenty <
> alexis.parenty.h...@gmail.com> wrote:
>
>> Hi Greg, many thanks for your quick response. That's exactly what I was
>> after. In addition, the quality of the new drawing code is superb.
>> Best,
>> Alexis
>>
>> On Thu, 23 Jan 2020 at 10:03, Greg Landrum 
>> wrote:
>>
>>> Hi Alexis,
>>>
>>> It's not currently possible to control the widths or colors of
>>> particular bonds in the molecular renderings, but you can certainly
>>> highlight arbitrary bonds (and the color of those highlights) in the
>>> molecular drawing.
>>> This is controlled using the highlightBondColors argument to
>>> DrawMolecule.
>>> Here's an example of that:
>>> https://gist.github.com/greglandrum/baafb4810aab474a0dd96dae9e34fcaf
>>>
>>> The Python code that actually generates the similarity map using the new
>>> drawing code (described in this blog post:
>>> http://rdkit.blogspot.com/2020/01/similarity-maps-with-new-drawing-code.html)
>>> is here:
>>>
>>> https://github.com/rdkit/rdkit/blob/master/rdkit/Chem/Draw/SimilarityMaps.py#L152
>>>
>>> At the moment you'd need to duplicate that and add the
>>> highlightBondColors argument to the call to DrawMolecules()
>>>
>>> thinking about adding an API to allow bond widths and colors to be
>>> directly changed is interesting...
>>>
>>> -greg
>>>
>>>
>>>
>>>
>>>
>>> On Wed, Jan 22, 2020 at 6:17 PM Alexis Parenty <
>>> alexis.parenty.h...@gmail.com> wrote:
>>>
 Hi everyone,

 I use SimilarityMaps.GetSimilarityMapFromWeights(mol, atom_ids) to
 highlight some parts of a structure, but is it also possible to change the
 thickness of some bonds of a structure knowing their atom IDs? If selected
 bonds cannot be bold, can we change their color?

 Many thanks and regards,

 Alexis
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

>>>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Observations about RDKit performance: PatternFingerprinter, Windows, Linux and Virtual machines

2020-01-24 Thread Thomas Strunz
Hi Maciek,

yeah I thought that this could be the issue as well but according to the tools 
(grep flags /proc/cpuinfo | uniq) or coreinfo on windows the VMs also support 
sse4.2 (and lower) and AVX.

In fact I seem to have to look further as I noticed that in general python 
performance (and possible more, not tested) is much slower on the VMs. See 
below code which is actually a way to see performance impact of vector 
extension and especially of intel mkl.

import numpy as np
import time

n = 2
A = np.random.randn(n,n).astype('float64')
B = np.random.randn(n,n).astype('float64')

start_time = time.time()
nrm = np.linalg.norm(A@B)
print(" took {} seconds ".format(time.time() - start_time))
print(" norm = ",nrm)


Last code fragment runs about 50% slower on the Windows VM compared to my 
laptop accounting for clock and core count differences. It's confusing to me as 
the performance difference is so consitent and apparent but I would assume if 
this was normal people would have noticed a long time ago?Yet I can't find 
anything about it. Or does everyone run their code native?

Best Regards,

Thomas


Von: Maciek Wójcikowski 
Gesendet: Donnerstag, 23. Januar 2020 11:04
An: Thomas Strunz 
Cc: Greg Landrum ; rdkit-discuss@lists.sourceforge.net 

Betreff: Re: [Rdkit-discuss] Observations about RDKit performance: 
PatternFingerprinter, Windows, Linux and Virtual machines

Thomas,

Could you double check if your VM has the same set of instructions as your 
host? For hardware popcounts, which are used to accelerate fingerprint 
operations, they might have profound impact on performance. SSE4.2 is probably 
the one that is used in the RDKit (at least this is stated in the code).

For KVM https://www.linux-kvm.org/page/Tuning_KVM (there are linux commands to 
check what is available on guest, so might be helpful for you too).
It also seems that in VMWare world this might be tricky, as it is considered to 
be a stability hazard: 
https://pubs.vmware.com/vsphere-50/index.jsp?topic=%2Fcom.vmware.vsphere.vcenterhost.doc_50%2FGUID-8B226625-4923-410C-B7AF-51BCD2806A3B.html

Best,
Maciek


Pozdrawiam,  |  Best regards,
Maciek Wójcikowski
mac...@wojcikowski.pl


czw., 23 sty 2020 o 08:15 Thomas Strunz 
mailto:beginn...@hotmail.de>> napisał(a):
Hi Greg,

reopening this old question. I can see that there are potential differences 
between rdkit version and especially Linux and Windows but let's lieave that 
aside for now.

After further "playing around" however I really have the impression there is a 
real issue with running rdkit (or python?) in a virtualized operating sytem. 
Since most production software and/or when using the cloud will mostly run in a 
virtualized operating system, I think this should be a fairly relevant topic 
worth investigation. As you showed yourself, the AWS System also was fairly 
slow.

For following observations I'm keeping the same datasets as before which is 
from your blog post ( /Regress/Scripts/fingerprint_screenout.py). basically 
it's that code slightly adapted:

mols = []
with gzip.open(data_dir + 'chembl21_25K.pairs.txt.gz', 'rb') as inf:
for line in inf:
line = line.decode().strip().split()
smi1 = line[1]
smi2 = line[3]
m1 = Chem.MolFromSmiles(smi1)
m2 = Chem.MolFromSmiles(smi2)
mols.append(m1)
mols.append(m2)

frags = [Chem.MolFromSmiles(x.split()[0]) for x in open(data_dir + 
'zinc.frags.500.q.smi', 'r')]

mfps = [Chem.PatternFingerprint(m, 512) for m in mols]
fragsfps = [Chem.PatternFingerprint(m, 512) for m in frags]

%%timeit -n1 -r1
for i, fragfp in enumerate(fragsfps):
hits = 0
for j, mfp in enumerate(mfps):
if DataStructs.AllProbeBitsMatch(fragfp, mfp):
if mols[j].HasSubstructMatch(frags[i]):
hits = hits + 1


I want to focus on the last cell and namley the "AllProbeBitsMatch" method:

%%timeit
DataStructs.AllProbeBitsMatch(fragsfps[10], mfps[10])

Results:

Windows 10 native i7-8850H: 567 
ns ± 5.48 ns per loop (mean ± std. dev. of 7 runs, 100 loops each)
Lubuntu 16.04 virtualized i7-8850H: 1.81 µs 
± 56.7 ns per loop (mean ± std. dev. of 7 runs, 100 loops each) // the high 
variation is consistent
Windows Server 2012 R2 virtualized Xeon E5-2620 v4:1.18 µs ± 4.09 ns per 
loop (mean ± std. dev. of 7 runs, 100 loops each)

So it seems virtualization seems to reduce  the performance of this specific 
method by half which is also what I see by running the full substructure search 
code which takes double the time on the virtualized machines. (The windows 
server actually runs on ESX (eg type 1 hypervisor) while the Lubuntu VM is a 
type 2 (Vmware workstation) but both seem to suffer the same.).

we can try same thing with

%%timeit
mols[10].HasSubstructMatch(frags[10])

The difference here is smaller but VMs