Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-10 Thread Dimitri Maziuk
On 2017-06-10 07:42, Chris Swain wrote: This sounds like the situation where a database might be a better option, tuned to store fingerprints in RAM? The issue is how much programming time it will take, how much that time is worth, and how many times the solution will be reused. A clever

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-10 Thread Chris Swain
- > > Message: 1 > Date: Fri, 9 Jun 2017 16:28:09 +0200 > From: Alexis Parenty <alexis.parenty.h...@gmail.com> > To: Greg Landrum <greg.land...@gmail.com> > Cc: RDKit Discuss <rdkit-discuss@lists.so

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Dimitri Maziuk
On 2017-06-09 08:12, Alexis Parenty wrote: Dear Greg and Brian, Many thanks for your response. I was also thinking of your streaming approach! I think the RAM of most machine would deal with lists of 100K mol so we could put the threshold higher than 1000. Actually, I was thinking to monitor

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Alexis Parenty
Yes Greg, this is what I am doing. You’re right, I did not think of the possibility to build a list of mol from the shorter list and process each of its mol with the mol of the longer list (which I would make on the flight from the smiles). However, I wanted to store the longest list of structures

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Greg Landrum
Hi Alexis, If I understand your use case correctly, you really don't need this level of complication. If you are comparing Q molecules to M molecules and M>>Q (in the discussion so far Q = 1000, M = 50) and you only need to compare each of the Qs to each of the Ms a single time, you can

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Brian Kelley
What exactly are you doing? Is this 1000x500k substructure queries or something different? Brian Kelley > On Jun 9, 2017, at 9:12 AM, Alexis Parenty > wrote: > > Dear Greg and Brian, > Many thanks for your response. I was also thinking of your streaming

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Alexis Parenty
Dear Greg and Brian, Many thanks for your response. I was also thinking of your streaming approach! I think the RAM of most machine would deal with lists of 100K mol so we could put the threshold higher than 1000. Actually, I was thinking to monitor the available RAM and only start processing the

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Brian Kelley
While not multithreaded (yet) this is the use case of the filter catalog: http://rdkit.blogspot.com/2016/04/changes-in-201603-release-filtercatalog.html?m=1 Look for the SmartsMatcher class in the blog. It is a good idea to make this multithreaded as well, I'll add this as a possible

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Greg Landrum
Hi Alexis, I would approach this by loading the 1000 queries into a list of molecules and then "stream" the others past that (so that you never attempt to load the full 500K set at once). Here's a quick sketch of one way to do this: In [4]: queries = [x for x in