Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

Brian Kelley Fri, 09 Jun 2017 06:41:44 -0700

 What exactly are you doing?

Is this 1000x500k substructure queries or something different?


----
Brian Kelley

> On Jun 9, 2017, at 9:12 AM, Alexis Parenty <alexis.parenty.h...@gmail.com> 
> wrote:
> 
> Dear Greg and Brian, 
> Many thanks for your response. I was also thinking of your streaming 
> approach! I think the RAM of most machine would deal with lists of 100K mol 
> so we could put the threshold higher than 1000. Actually, I was thinking to 
> monitor the available RAM and only start processing the matrix and clearing 
> the list when less than 20% of RAM is left. This way, the best machines could 
> skip the clearing process and gain time. What do you think?
> 
> 
> Best,
> 
> Alexis
> 
> 
> 
> 
> 
>> On 9 June 2017 at 14:40, Brian Kelley <fustiga...@gmail.com> wrote:
>> While not multithreaded (yet) this is the use case of the filter catalog:
>> 
>> http://rdkit.blogspot.com/2016/04/changes-in-201603-release-filtercatalog.html?m=1
>> 
>> Look for the SmartsMatcher class in the blog.
>> 
>> It is a good idea to make this multithreaded as well, I'll add this as a 
>> possible enhancement.
>> 
>> ----
>> Brian Kelley
>> 
>>> On Jun 9, 2017, at 7:04 AM, Greg Landrum <greg.land...@gmail.com> wrote:
>>> 
>>> Hi Alexis,
>>> 
>>> I would approach this by loading the 1000 queries into a list of molecules 
>>> and then "stream" the others past that (so that you never attempt to load 
>>> the full 500K set at once).
>>> 
>>> Here's a quick sketch of one way to do this:
>>> In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if 
>>> x is not None]
>>> 
>>> In [5]: matches = []
>>> 
>>> In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
>>>    ...:     if m is None:
>>>    ...:         continue
>>>    ...:     matches.append([m.HasSubstructMatch(q) for q in queries])
>>>    ...:     
>>> 
>>> 
>>> Brian has some thoughts on making this particular use case easier/faster 
>>> (in particular by adding multi-threading support), so maybe there will be 
>>> something in the next release there.
>>> 
>>> I hope this helps,
>>> -greg
>>> 
>>> 
>>>> On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty 
>>>> <alexis.parenty.h...@gmail.com> wrote:
>>>> Dear RDKit community,
>>>> 
>>>> I need to screen for substructure relationships between two sets of 
>>>> structures (1 000 X 500 000): I thought I should build two lists of mol 
>>>> objects from SMILES, but I keep having a memory error when the second list 
>>>> reaches 300 000 mol. All my RAM (12G) gets consumed along with all my 
>>>> virtual memory.
>>>> 
>>>> Do I really have to compromise on speed and make mol object on the flight 
>>>> from two lists of SMILES? Is there another memory efficient way to store 
>>>> mol object?
>>>> 
>>>> Best,
>>>> 
>>>> Alexis
>>>> 
>>>> 
>>>> ------------------------------------------------------------------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>>> _______________________________________________
>>>> Rdkit-discuss mailing list
>>>> Rdkit-discuss@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>> 
>>> 
>>> ------------------------------------------------------------------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

Reply via email to