Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Dimitri Maziuk
On 2017-06-09 08:12, Alexis Parenty wrote: Dear Greg and Brian, Many thanks for your response. I was also thinking of your streaming approach! I think the RAM of most machine would deal with lists of 100K mol so we could put the threshold higher than 1000. Actually, I was thinking to monitor

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Alexis Parenty
Yes Greg, this is what I am doing. You’re right, I did not think of the possibility to build a list of mol from the shorter list and process each of its mol with the mol of the longer list (which I would make on the flight from the smiles). However, I wanted to store the longest list of structures

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Greg Landrum
Hi Alexis, If I understand your use case correctly, you really don't need this level of complication. If you are comparing Q molecules to M molecules and M>>Q (in the discussion so far Q = 1000, M = 50) and you only need to compare each of the Qs to each of the Ms a single time, you can

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Brian Kelley
What exactly are you doing? Is this 1000x500k substructure queries or something different? Brian Kelley > On Jun 9, 2017, at 9:12 AM, Alexis Parenty > wrote: > > Dear Greg and Brian, > Many thanks for your response. I was also thinking of your streaming

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Alexis Parenty
Dear Greg and Brian, Many thanks for your response. I was also thinking of your streaming approach! I think the RAM of most machine would deal with lists of 100K mol so we could put the threshold higher than 1000. Actually, I was thinking to monitor the available RAM and only start processing the

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Brian Kelley
While not multithreaded (yet) this is the use case of the filter catalog: http://rdkit.blogspot.com/2016/04/changes-in-201603-release-filtercatalog.html?m=1 Look for the SmartsMatcher class in the blog. It is a good idea to make this multithreaded as well, I'll add this as a possible

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Greg Landrum
Hi Alexis, I would approach this by loading the 1000 queries into a list of molecules and then "stream" the others past that (so that you never attempt to load the full 500K set at once). Here's a quick sketch of one way to do this: In [4]: queries = [x for x in