Yes Greg, this is what I am doing. You’re right, I did not think of the
possibility to build a list of mol from the shorter list and process each
of its mol with the mol of the longer list (which I would make on the
flight from the smiles). However, I wanted to store the longest list of
structures in order to access it again later for new substructure search
from single structure at a time… It seemed silly to have to rebuild mol
object from a 500K list of smiles every time I need to do a new
substructure search. But your approach is going to help me a lot for the
batch mode search I wanted to do.

Best,

Alexis

On 9 June 2017 at 15:42, Greg Landrum <greg.land...@gmail.com> wrote:

> Hi Alexis,
>
> If I understand your use case correctly, you really don't need this level
> of complication.
>
> If you are comparing Q molecules to M molecules and M>>Q (in the
> discussion so far Q = 1000, M = 500000) and you only need to compare each
> of the Qs to each of the Ms a single time, you can safely construct all the
> Q molecules and store them in memory and then loop over the Ms individually
> and compare them to each of the Qs (this is what I did in my little
> sample). This will have more or less exactly the same performance as
> reading all of the Ms at once and then processing them.
>
> so, on a machine with infinite memory these two snippets will take more or
> less the same amount of time to execute:
>
> low memory usage:
>
> queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is
> not None]
> matches = []
> for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
>     if m is None:
>         continue
>     matches.append([m.HasSubstructMatch(q) for q in queries])
>
>
>
> high memory usage:
>
> queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is
> not None]
> mols = [x for x in Chem.ForwardSDMolSupplier('./znp.50k.sdf') if x is not
> None]
> matches = []
> for m in mols:
>     if m is None:
>         continue
>     matches.append([m.HasSubstructMatch(q) for q in queries])
>
>
>
> The second form consumes a lot more memory without delivering any
> improvement in performance.
>
> Best,
> -greg
>
>
> On Fri, Jun 9, 2017 at 3:33 PM, Alexis Parenty <
> alexis.parenty.h...@gmail.com> wrote:
>
>> Hi again, FYI here is the memory monitoring in attachment. Thanks,
>>
>> Alexis
>>
>> On 9 June 2017 at 15:12, Alexis Parenty <alexis.parenty.h...@gmail.com>
>> wrote:
>>
>>> Dear Greg and Brian,
>>> Many thanks for your response. I was also thinking of your streaming
>>> approach! I think the RAM of most machine would deal with lists of 100K mol
>>> so we could put the threshold higher than 1000. Actually, I was thinking to
>>> monitor the available RAM and only start processing the matrix and clearing
>>> the list when less than 20% of RAM is left. This way, the best machines
>>> could skip the clearing process and gain time. What do you think?
>>>
>>>
>>> Best,
>>>
>>> Alexis
>>>
>>>
>>>
>>>
>>>
>>> On 9 June 2017 at 14:40, Brian Kelley <fustiga...@gmail.com> wrote:
>>>
>>>> While not multithreaded (yet) this is the use case of the filter
>>>> catalog:
>>>>
>>>> http://rdkit.blogspot.com/2016/04/changes-in-201603-release-
>>>> filtercatalog.html?m=1
>>>>
>>>> Look for the SmartsMatcher class in the blog.
>>>>
>>>> It is a good idea to make this multithreaded as well, I'll add this as
>>>> a possible enhancement.
>>>>
>>>> ----
>>>> Brian Kelley
>>>>
>>>> On Jun 9, 2017, at 7:04 AM, Greg Landrum <greg.land...@gmail.com>
>>>> wrote:
>>>>
>>>> Hi Alexis,
>>>>
>>>> I would approach this by loading the 1000 queries into a list of
>>>> molecules and then "stream" the others past that (so that you never attempt
>>>> to load the full 500K set at once).
>>>>
>>>> Here's a quick sketch of one way to do this:
>>>>
>>>> In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf')
>>>> if x is not None]
>>>>
>>>> In [5]: matches = []
>>>>
>>>> In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
>>>>    ...:     if m is None:
>>>>    ...:         continue
>>>>    ...:     matches.append([m.HasSubstructMatch(q) for q in queries])
>>>>    ...:
>>>>
>>>>
>>>>
>>>> Brian has some thoughts on making this particular use case
>>>> easier/faster (in particular by adding multi-threading support), so maybe
>>>> there will be something in the next release there.
>>>>
>>>> I hope this helps,
>>>> -greg
>>>>
>>>>
>>>> On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty <
>>>> alexis.parenty.h...@gmail.com> wrote:
>>>>
>>>>> Dear RDKit community,
>>>>>
>>>>> I need to screen for substructure relationships between two sets of
>>>>> structures (1 000 X 500 000): I thought I should build two lists of mol
>>>>> objects from SMILES, but I keep having a memory error when the second list
>>>>> reaches 300 000 mol. All my RAM (12G) gets consumed along with all my
>>>>> virtual memory.
>>>>>
>>>>> Do I really have to compromise on speed and make mol object on the
>>>>> flight from two lists of SMILES? Is there another memory efficient way to
>>>>> store mol object?
>>>>>
>>>>> Best,
>>>>>
>>>>> Alexis
>>>>>
>>>>> ------------------------------------------------------------
>>>>> ------------------
>>>>> Check out the vibrant tech community on one of the world's most
>>>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>>>> _______________________________________________
>>>>> Rdkit-discuss mailing list
>>>>> Rdkit-discuss@lists.sourceforge.net
>>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>>
>>>>>
>>>> ------------------------------------------------------------
>>>> ------------------
>>>> Check out the vibrant tech community on one of the world's most
>>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>>>
>>>> _______________________________________________
>>>> Rdkit-discuss mailing list
>>>> Rdkit-discuss@lists.sourceforge.net
>>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>>
>>>>
>>>
>>
>
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to