Re: [Rdkit-discuss] Observations about RDKit performance: PatternFingerprinter, Windows, Linux and Virtual machines

Maciek Wójcikowski Thu, 23 Jan 2020 02:34:23 -0800

Thomas,

Could you double check if your VM has the same set of instructions as your
host? For hardware popcounts, which are used to accelerate fingerprint
operations, they might have profound impact on performance. SSE4.2 is
probably the one that is used in the RDKit (at least this is stated in the
code).


For KVM https://www.linux-kvm.org/page/Tuning_KVM (there are linux commands
to check what is available on guest, so might be helpful for you too).
It also seems that in VMWare world this might be tricky, as it is
considered to be a stability hazard:
https://pubs.vmware.com/vsphere-50/index.jsp?topic=%2Fcom.vmware.vsphere.vcenterhost.doc_50%2FGUID-8B226625-4923-410C-B7AF-51BCD2806A3B.html

Best,
Maciek

----
Pozdrawiam,  |  Best regards,
Maciek Wójcikowski
[email protected]


czw., 23 sty 2020 o 08:15 Thomas Strunz <[email protected]> napisał(a):

> Hi Greg,
>
> reopening this old question. I can see that there are potential
> differences between rdkit version and especially Linux and Windows but
> let's lieave that aside for now.
>
> After further "playing around" however I really have the impression there
> is a real issue with running rdkit (or python?) in a virtualized operating
> sytem. Since most production software and/or when using the cloud will
> mostly run in a virtualized operating system, I think this should be a
> fairly relevant topic worth investigation. As you showed yourself, the AWS
> System also was fairly slow.
>
> For following observations I'm keeping the same datasets as before which
> is from your blog post ( /Regress/Scripts/fingerprint_screenout.py).
> basically it's that code slightly adapted:
>
> mols = []
> with gzip.open(data_dir + 'chembl21_25K.pairs.txt.gz', 'rb') as inf:
>     for line in inf:
>         line = line.decode().strip().split()
>         smi1 = line[1]
>         smi2 = line[3]
>         m1 = Chem.MolFromSmiles(smi1)
>         m2 = Chem.MolFromSmiles(smi2)
>         mols.append(m1)
>         mols.append(m2)
>
> frags = [Chem.MolFromSmiles(x.split()[0]) for x in open(data_dir +
> 'zinc.frags.500.q.smi', 'r')]
>
> mfps = [Chem.PatternFingerprint(m, 512) for m in mols]
> fragsfps = [Chem.PatternFingerprint(m, 512) for m in frags]
>
> %%timeit -n1 -r1
> for i, fragfp in enumerate(fragsfps):
>     hits = 0
>     for j, mfp in enumerate(mfps):
>         if DataStructs.AllProbeBitsMatch(fragfp, mfp):
>             if mols[j].HasSubstructMatch(frags[i]):
>                 hits = hits + 1
>
>
> I want to focus on the last cell and namley the "AllProbeBitsMatch" method:
>
> %%timeit
> DataStructs.AllProbeBitsMatch(fragsfps[10], mfps[10])
>
> Results:
>
> Windows 10 native i7-8850H:
>    567 ns ± 5.48 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops
> each)
> Lubuntu 16.04 virtualized i7-8850H:                                     1.81
> µs ± 56.7 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each) //
> the high variation is consistent
> Windows Server 2012 R2 virtualized Xeon E5-2620 v4:    1.18 µs ± 4.09 ns
> per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
>
> So it seems virtualization seems to reduce  the performance of this
> specific method by half which is also what I see by running the full
> substructure search code which takes double the time on the virtualized
> machines. (The windows server actually runs on ESX (eg type 1 hypervisor)
> while the Lubuntu VM is a type 2 (Vmware workstation) but both seem to
> suffer the same.).
>
> we can try same thing with
>
> %%timeit
> mols[10].HasSubstructMatch(frags[10])
>
> The difference here is smaller but VMs also take >50% more time.
>
> So there seems to be a consistent large performance impact in VMs.
>
> Of course the VM will be a bit slower but not by that much? What am I
> missing? Other experiences?
>
> Best Regards,
>
> Thomas
> ------------------------------
> *Von:* Greg Landrum <[email protected]>
> *Gesendet:* Montag, 16. Dezember 2019 17:10
> *An:* Thomas Strunz <[email protected]>
> *Cc:* [email protected] <
> [email protected]>
> *Betreff:* Re: [Rdkit-discuss] Observations about RDKit performance:
> PatternFingerprinter, Windows, Linux and Virtual machines
>
> Hi Thomas,
>
> First it is important to compare equivalent major versions to each other.
> Particularly in this case. On my linux box generating the pattern
> fingerprints takes 24.2 seconds with v2019.03.x and 15.9 seconds with
> v2019.09.x (that's due to the improvements in the substructure matcher that
> the blog post you link to discusses).
>
> Comparing the same versions to each other:
>
> Performance on windows vs linux
> Windows performance with the RDKit has always lagged behind linux
> performance. There's something in the code (or in the way we use the
> compiler) that leads to big differences on some benchmarks. The most
> straightforward way I can demonstrate this is with results from my windows
> 10 laptop.
> Here's the output when running the fingerprint_screenout.py benchmark
> using the windows build:
> | 2019.09.1 | 13.6 | 0.3 | 38.1 | 0.8 | 25.5 | 25.9 | 84.1 |
> and here's the output from a linux build running on the Windows Linux
> Subsystem:
> | 2019.09.2 | 10.7 | 0.2 | 19.3 | 0.4 | 19.4 | 19.2 | 53.2 |
> You can see the differences are not small.
> I haven't invested massive time into it, but I haven't been able to figure
> out what causes this.
>
> Performance on (linux) VMs
> I can't think of any particular reason why there should be huge
> differences and it's really difficult to compare apples to apples here.
> Since I have the numbers, here's one comparison
>
> Here's a run on my linux workstation:
> | 2019.09.2 | 7.6 | 0.3 | 15.9 | 0.4 | 21.4 | 20.4 | 55.7 |
> and here's the same thing on an AWS t3.xlarge instance:
> | 2019.09.2 | 9.6 | 0.2 | 20.3 | 0.4 | 38.4 | 38.2 | 94.7 |
> The VM is significantly slower, but t3.xlarge an instance type that's
> intended to be used for compute intensive jobs (I don't have on of those
> active and configured at the moment).
>
> Does that help at all?
> -greg
>
>
> On Mon, Dec 16, 2019 at 8:27 AM Thomas Strunz <[email protected]>
> wrote:
>
> Hi All,
>
> I was looking at a blog post from greg:
>
>
> https://rdkit.blogspot.com/2019/07/a-couple-of-substructure-search-topics.html
>
> about fingerprint screenout. The part that got me confused was the timings
> in his blog post because run times in my case where a lot slower.
>
> Gregs numbers:
>
> [07:21:19] INFO: mols from smiles
> [07:21:27] INFO: Results1:  7.77 seconds, 50000 mols
> [07:21:27] INFO: queries from smiles
> [07:21:27] INFO: Results2:  0.16 seconds*[07:21:27] INFO: generating pattern 
> fingerprints for mols
> [07:21:43] INFO: Results3:  16.11 seconds*
> [07:21:43] INFO: generating pattern fingerprints for queries
> [07:21:43] INFO: Results4:  0.34 seconds
> [07:21:43] INFO: testing frags queries
> [07:22:03] INFO: Results5:  19.90 seconds. 6753 tested (0.0003 of total), 
> 3989 found,  0.59 accuracy. 0 errors.
> [07:22:03] INFO: testing leads queries
> [07:22:23] INFO: Results6:  19.77 seconds. 1586 tested (0.0001 of total), 
> 1067 found,  0.67 accuracy. 0 errors.
> [07:22:23] INFO: testing pieces queries
> [07:23:19] INFO: Results7:  55.37 seconds. 3333202 tested (0.0810 of total), 
> 1925628 found,  0.58 accuracy. 0 errors.
>
> | 2019.09.1dev1 | 7.8 | 0.2 | 16.1 | 0.3 | 19.9 | 19.8 | 55.4 |
>
>
>
>
> *Machine 1:*
> Virtual machine, Windows Server 2012 R2 with an intel xeon (4 virtual
> cores)
>
> Since the test is single-threaded it makes a bit of sense that it isn't
> fast here but it's not just a bit slower, but a lot slower, depending on
> test almost 3xtimes slower
>
> [09:03:19] INFO: mols from smiles
> [09:03:38] INFO: Results1:  19.44 seconds, 50000 mols
> [09:03:38] INFO: queries from smiles
> [09:03:38] INFO: Results2:  0.36 seconds
>
> *[09:03:38] INFO: generating pattern fingerprints for mols *
> *[09:04:54] INFO: Results3:  75.99 seconds*
> [09:04:54] INFO: generating pattern fingerprints for queries
> [09:04:56] INFO: Results4:  1.55 seconds
> [09:04:56] INFO: testing frags queries
> [09:05:34] INFO: Results5:  37.59 seconds. 6753 tested (0.0003 of total),
> 3989 f
> ound,  0.59 accuracy. 0 errors.
> [09:05:34] INFO: testing leads queries
> [09:06:11] INFO: Results6:  37.34 seconds. 1586 tested (0.0001 of total),
> 1067 f
> ound,  0.67 accuracy. 0 errors.
> [09:06:11] INFO: testing pieces queries
> [09:08:39] INFO: Results7:  147.79 seconds. 3333202 tested (0.0810 of
> total), 19
> 25628 found,  0.58 accuracy. 0 errors.
> | 2019.03.3 | 19.4 | 0.4 | 76.0 | 1.5 | 37.6 | 37.3 | 147.8 |
>
> I thought maybe another issue with windows being slow so I tested on a
> linux VM on my laptop
>
> *Machine 2:*
> Virtual machine, Lubuntu 16.04 on a laptop i7-8850H 6-core
>
> [09:23:31] INFO: mols from smiles
> [09:23:54] INFO: Results1:  23.71 seconds, 50000 mols
> [09:23:54] INFO: queries from smiles
> [09:23:55] INFO: Results2:  0.48 seconds
>
> *[09:23:55] INFO: generating pattern fingerprints for mols *
> *[09:24:53] INFO: Results3:  58.31 seconds*
> [09:24:53] INFO: generating pattern fingerprints for queries
> [09:24:54] INFO: Results4:  1.19 seconds
> [09:24:54] INFO: testing frags queries
> [09:25:41] INFO: Results5:  46.22 seconds. 6753 tested (0.0003 of total),
> 3989 found,  0.59 accuracy. 0 errors.
> [09:25:41] INFO: testing leads queries
> [09:26:26] INFO: Results6:  45.84 seconds. 1586 tested (0.0001 of total),
> 1067 found,  0.67 accuracy. 0 errors.
> [09:26:26] INFO: testing pieces queries
> [09:28:33] INFO: Results7:  126.78 seconds. 3333202 tested (0.0810 of
> total), 1925628 found,  0.58 accuracy. 0 errors.
> | 2019.03.3 | 23.7 | 0.5 | 58.3 | 1.2 | 46.2 | 45.8 | 126.8 |
>
> Pretty weird sometimes even slower sometimes faster than the windows VM
> but still a lot slower than Gregs numbers (I repeated with rdkit 2019.09.2
> and got comparable results)
>
> So I also tested on above laptop directly:
>
> *Machine 3:*
> physical install, windows 10 on a laptop i7-8850H 6-core (same machine as
> 2)
>
> [09:51:43] INFO: mols from smiles
> [09:51:54] INFO: Results1:  10.59 seconds, 50000 mols
> [09:51:54] INFO: queries from smiles
> [09:51:54] INFO: Results2:  0.20 seconds
>
> *[09:51:54] INFO: generating pattern fingerprints for mols *
> *[09:52:24] INFO: Results3:  29.50 seconds*
> [09:52:24] INFO: generating pattern fingerprints for queries
> [09:52:24] INFO: Results4:  0.61 seconds
> [09:52:24] INFO: testing frags queries
> [09:52:44] INFO: Results5:  19.71 seconds. 6753 tested (0.0003 of total),
> 3989 found,  0.59 accuracy. 0 errors.
> [09:52:44] INFO: testing leads queries
> [09:53:04] INFO: Results6:  19.48 seconds. 1586 tested (0.0001 of total),
> 1067 found,  0.67 accuracy. 0 errors.
> [09:53:04] INFO: testing pieces queries
> [09:54:05] INFO: Results7:  61.94 seconds. 3333202 tested (0.0810 of
> total), 1925628 found,  0.58 accuracy. 0 errors.
> | 2019.09.1 | 10.6 | 0.2 | 29.5 | 0.6 | 19.7 | 19.5 | 61.9 |
>
> This is much closer to Gregs results, except for the fingerprinting which
> takes almost double the time.  Also notice how the fingerprinting on the
> linux VM is much faster also compared to other results than on the windows
> VM?
>
> *Conclusions:*
>
>    1. Form what I see, it seems that the pattern fingerprinter runs a lot
>    slower on windows. Is this known issue?
>    2. In virtual machines the rdkits performance simply tanks, is much
>    worse. A certain penalty is to be expected but not this much. Or what am I
>    missing? Machine 1 runs on central infrastructure so I would assume
>    virtualization is configured correctly. For the local VM, vt-x is enabled.
>    Yet it is much slower compared to the physical machine (plus that AFAIK
>    rdkit runs faster in linux vs windows)
>
> Especially the virtual machine aspect is kind of troubling because I would
> assume many real-world applications are deployed as VM and hence might
> suffer from this too?
> I don't have a well defined question but more interested in other users
> experience especially regarding the virtualization.
>
> Best Regards,
>
> Thomas
> _______________________________________________
> Rdkit-discuss mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
> _______________________________________________
> Rdkit-discuss mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Observations about RDKit performance: PatternFingerprinter, Windows, Linux and Virtual machines

Reply via email to