What exactly are you doing?
Is this 1000x500k substructure queries or something different?
----
Brian Kelley
> On Jun 9, 2017, at 9:12 AM, Alexis Parenty <alexis.parenty.h...@gmail.com>
> wrote:
>
> Dear Greg and Brian,
> Many thanks for your response. I was also thinking of your streaming
> approach! I think the RAM of most machine would deal with lists of 100K mol
> so we could put the threshold higher than 1000. Actually, I was thinking to
> monitor the available RAM and only start processing the matrix and clearing
> the list when less than 20% of RAM is left. This way, the best machines could
> skip the clearing process and gain time. What do you think?
>
>
> Best,
>
> Alexis
>
>
>
>
>
>> On 9 June 2017 at 14:40, Brian Kelley <fustiga...@gmail.com> wrote:
>> While not multithreaded (yet) this is the use case of the filter catalog:
>>
>> http://rdkit.blogspot.com/2016/04/changes-in-201603-release-filtercatalog.html?m=1
>>
>> Look for the SmartsMatcher class in the blog.
>>
>> It is a good idea to make this multithreaded as well, I'll add this as a
>> possible enhancement.
>>
>> ----
>> Brian Kelley
>>
>>> On Jun 9, 2017, at 7:04 AM, Greg Landrum <greg.land...@gmail.com> wrote:
>>>
>>> Hi Alexis,
>>>
>>> I would approach this by loading the 1000 queries into a list of molecules
>>> and then "stream" the others past that (so that you never attempt to load
>>> the full 500K set at once).
>>>
>>> Here's a quick sketch of one way to do this:
>>> In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if
>>> x is not None]
>>>
>>> In [5]: matches = []
>>>
>>> In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
>>> ...: if m is None:
>>> ...: continue
>>> ...: matches.append([m.HasSubstructMatch(q) for q in queries])
>>> ...:
>>>
>>>
>>> Brian has some thoughts on making this particular use case easier/faster
>>> (in particular by adding multi-threading support), so maybe there will be
>>> something in the next release there.
>>>
>>> I hope this helps,
>>> -greg
>>>
>>>
>>>> On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty
>>>> <alexis.parenty.h...@gmail.com> wrote:
>>>> Dear RDKit community,
>>>>
>>>> I need to screen for substructure relationships between two sets of
>>>> structures (1 000 X 500 000): I thought I should build two lists of mol
>>>> objects from SMILES, but I keep having a memory error when the second list
>>>> reaches 300 000 mol. All my RAM (12G) gets consumed along with all my
>>>> virtual memory.
>>>>
>>>> Do I really have to compromise on speed and make mol object on the flight
>>>> from two lists of SMILES? Is there another memory efficient way to store
>>>> mol object?
>>>>
>>>> Best,
>>>>
>>>> Alexis
>>>>
>>>>
>>>> ------------------------------------------------------------------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>>> _______________________________________________
>>>> Rdkit-discuss mailing list
>>>> Rdkit-discuss@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>
>>>
>>> ------------------------------------------------------------------------------
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> _______________________________________________
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss