Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-10 Thread Dimitri Maziuk

On 2017-06-10 07:42, Chris Swain wrote:
This sounds like the situation where a database might be a better 
option, tuned to store fingerprints in RAM?


The issue is how much programming time it will take, how much that time 
is worth, and how many times the solution will be reused. A clever 
coding solution could be preferable for other reasons, like a 
programming exercise. If it's a one-off and you just need it done and 
move on, throwing more hardware at it is often the most cost-effective 
solution.


Dima



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-10 Thread Chris Swain
This sounds like the situation where a database might be a better option, tuned 
to store fingerprints in RAM?

Chris


Dr Chris Swain BA MA (Cantab) PhD CChem FRSC
Macs in Chemistry
sw...@mac.com
http://www.macinchem.org



> On 10 Jun 2017, at 13:10, rdkit-discuss-requ...@lists.sourceforge.net wrote:
> 
> Send Rdkit-discuss mailing list submissions to
>   rdkit-discuss@lists.sourceforge.net
> 
> To subscribe or unsubscribe via the World Wide Web, visit
>   https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> or, via email, send a message with subject or body 'help' to
>   rdkit-discuss-requ...@lists.sourceforge.net
> 
> You can reach the person managing the list at
>   rdkit-discuss-ow...@lists.sourceforge.net
> 
> When replying, please edit your Subject line so it is more specific
> than "Re: Contents of Rdkit-discuss digest..."
> 
> 
> Today's Topics:
> 
>   1. Re: Memory issue when storing more than 300K mol in a list
>  (Alexis Parenty)
>   2. Re: Memory issue when storing more than 300K mol in a list
>  (Dimitri Maziuk)
> 
> 
> --
> 
> Message: 1
> Date: Fri, 9 Jun 2017 16:28:09 +0200
> From: Alexis Parenty <alexis.parenty.h...@gmail.com>
> To: Greg Landrum <greg.land...@gmail.com>
> Cc: RDKit Discuss <rdkit-discuss@lists.sourceforge.net>
> Subject: Re: [Rdkit-discuss] Memory issue when storing more than 300K
>   mol in a list
> Message-ID:
>   <cal3fkckr2zqtcjdc8qf_i4jhlhm+jectrif-gzu6ndg4aka...@mail.gmail.com>
> Content-Type: text/plain; charset="utf-8"
> 
> Yes Greg, this is what I am doing. You?re right, I did not think of the
> possibility to build a list of mol from the shorter list and process each
> of its mol with the mol of the longer list (which I would make on the
> flight from the smiles). However, I wanted to store the longest list of
> structures in order to access it again later for new substructure search
> from single structure at a time? It seemed silly to have to rebuild mol
> object from a 500K list of smiles every time I need to do a new
> substructure search. But your approach is going to help me a lot for the
> batch mode search I wanted to do.
> 
> Best,
> 
> Alexis
> 

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Dimitri Maziuk

On 2017-06-09 08:12, Alexis Parenty wrote:

Dear Greg and Brian,
Many thanks for your response. I was also thinking of your streaming 
approach! I think the RAM of most machine would deal with lists of 100K 
mol so we could put the threshold higher than 1000. Actually, I was 
thinking to monitor the available RAM and only start processing the 
matrix and clearing the list when less than 20% of RAM is left. This 
way, the best machines could skip the clearing process and gain time. 
What do you think?


Take $100, buy a 200GB SSD, set it up as the swap space, don't worry 
about the RAM.


Dima



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Alexis Parenty
Yes Greg, this is what I am doing. You’re right, I did not think of the
possibility to build a list of mol from the shorter list and process each
of its mol with the mol of the longer list (which I would make on the
flight from the smiles). However, I wanted to store the longest list of
structures in order to access it again later for new substructure search
from single structure at a time… It seemed silly to have to rebuild mol
object from a 500K list of smiles every time I need to do a new
substructure search. But your approach is going to help me a lot for the
batch mode search I wanted to do.

Best,

Alexis

On 9 June 2017 at 15:42, Greg Landrum  wrote:

> Hi Alexis,
>
> If I understand your use case correctly, you really don't need this level
> of complication.
>
> If you are comparing Q molecules to M molecules and M>>Q (in the
> discussion so far Q = 1000, M = 50) and you only need to compare each
> of the Qs to each of the Ms a single time, you can safely construct all the
> Q molecules and store them in memory and then loop over the Ms individually
> and compare them to each of the Qs (this is what I did in my little
> sample). This will have more or less exactly the same performance as
> reading all of the Ms at once and then processing them.
>
> so, on a machine with infinite memory these two snippets will take more or
> less the same amount of time to execute:
>
> low memory usage:
>
> queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is
> not None]
> matches = []
> for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
> if m is None:
> continue
> matches.append([m.HasSubstructMatch(q) for q in queries])
>
>
>
> high memory usage:
>
> queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is
> not None]
> mols = [x for x in Chem.ForwardSDMolSupplier('./znp.50k.sdf') if x is not
> None]
> matches = []
> for m in mols:
> if m is None:
> continue
> matches.append([m.HasSubstructMatch(q) for q in queries])
>
>
>
> The second form consumes a lot more memory without delivering any
> improvement in performance.
>
> Best,
> -greg
>
>
> On Fri, Jun 9, 2017 at 3:33 PM, Alexis Parenty <
> alexis.parenty.h...@gmail.com> wrote:
>
>> Hi again, FYI here is the memory monitoring in attachment. Thanks,
>>
>> Alexis
>>
>> On 9 June 2017 at 15:12, Alexis Parenty 
>> wrote:
>>
>>> Dear Greg and Brian,
>>> Many thanks for your response. I was also thinking of your streaming
>>> approach! I think the RAM of most machine would deal with lists of 100K mol
>>> so we could put the threshold higher than 1000. Actually, I was thinking to
>>> monitor the available RAM and only start processing the matrix and clearing
>>> the list when less than 20% of RAM is left. This way, the best machines
>>> could skip the clearing process and gain time. What do you think?
>>>
>>>
>>> Best,
>>>
>>> Alexis
>>>
>>>
>>>
>>>
>>>
>>> On 9 June 2017 at 14:40, Brian Kelley  wrote:
>>>
 While not multithreaded (yet) this is the use case of the filter
 catalog:

 http://rdkit.blogspot.com/2016/04/changes-in-201603-release-
 filtercatalog.html?m=1

 Look for the SmartsMatcher class in the blog.

 It is a good idea to make this multithreaded as well, I'll add this as
 a possible enhancement.

 
 Brian Kelley

 On Jun 9, 2017, at 7:04 AM, Greg Landrum 
 wrote:

 Hi Alexis,

 I would approach this by loading the 1000 queries into a list of
 molecules and then "stream" the others past that (so that you never attempt
 to load the full 500K set at once).

 Here's a quick sketch of one way to do this:

 In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf')
 if x is not None]

 In [5]: matches = []

 In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
...: if m is None:
...: continue
...: matches.append([m.HasSubstructMatch(q) for q in queries])
...:



 Brian has some thoughts on making this particular use case
 easier/faster (in particular by adding multi-threading support), so maybe
 there will be something in the next release there.

 I hope this helps,
 -greg


 On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty <
 alexis.parenty.h...@gmail.com> wrote:

> Dear RDKit community,
>
> I need to screen for substructure relationships between two sets of
> structures (1 000 X 500 000): I thought I should build two lists of mol
> objects from SMILES, but I keep having a memory error when the second list
> reaches 300 000 mol. All my RAM (12G) gets consumed along with all my
> virtual memory.
>
> Do I really have to compromise on speed and make mol object on the
> flight from two lists of SMILES? 

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Greg Landrum
Hi Alexis,

If I understand your use case correctly, you really don't need this level
of complication.

If you are comparing Q molecules to M molecules and M>>Q (in the discussion
so far Q = 1000, M = 50) and you only need to compare each of the Qs to
each of the Ms a single time, you can safely construct all the Q molecules
and store them in memory and then loop over the Ms individually and compare
them to each of the Qs (this is what I did in my little sample). This will
have more or less exactly the same performance as reading all of the Ms at
once and then processing them.

so, on a machine with infinite memory these two snippets will take more or
less the same amount of time to execute:

low memory usage:

queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is
not None]
matches = []
for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
if m is None:
continue
matches.append([m.HasSubstructMatch(q) for q in queries])



high memory usage:

queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is
not None]
mols = [x for x in Chem.ForwardSDMolSupplier('./znp.50k.sdf') if x is not
None]
matches = []
for m in mols:
if m is None:
continue
matches.append([m.HasSubstructMatch(q) for q in queries])



The second form consumes a lot more memory without delivering any
improvement in performance.

Best,
-greg


On Fri, Jun 9, 2017 at 3:33 PM, Alexis Parenty <
alexis.parenty.h...@gmail.com> wrote:

> Hi again, FYI here is the memory monitoring in attachment. Thanks,
>
> Alexis
>
> On 9 June 2017 at 15:12, Alexis Parenty 
> wrote:
>
>> Dear Greg and Brian,
>> Many thanks for your response. I was also thinking of your streaming
>> approach! I think the RAM of most machine would deal with lists of 100K mol
>> so we could put the threshold higher than 1000. Actually, I was thinking to
>> monitor the available RAM and only start processing the matrix and clearing
>> the list when less than 20% of RAM is left. This way, the best machines
>> could skip the clearing process and gain time. What do you think?
>>
>>
>> Best,
>>
>> Alexis
>>
>>
>>
>>
>>
>> On 9 June 2017 at 14:40, Brian Kelley  wrote:
>>
>>> While not multithreaded (yet) this is the use case of the filter catalog:
>>>
>>> http://rdkit.blogspot.com/2016/04/changes-in-201603-release-
>>> filtercatalog.html?m=1
>>>
>>> Look for the SmartsMatcher class in the blog.
>>>
>>> It is a good idea to make this multithreaded as well, I'll add this as a
>>> possible enhancement.
>>>
>>> 
>>> Brian Kelley
>>>
>>> On Jun 9, 2017, at 7:04 AM, Greg Landrum  wrote:
>>>
>>> Hi Alexis,
>>>
>>> I would approach this by loading the 1000 queries into a list of
>>> molecules and then "stream" the others past that (so that you never attempt
>>> to load the full 500K set at once).
>>>
>>> Here's a quick sketch of one way to do this:
>>>
>>> In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf')
>>> if x is not None]
>>>
>>> In [5]: matches = []
>>>
>>> In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
>>>...: if m is None:
>>>...: continue
>>>...: matches.append([m.HasSubstructMatch(q) for q in queries])
>>>...:
>>>
>>>
>>>
>>> Brian has some thoughts on making this particular use case easier/faster
>>> (in particular by adding multi-threading support), so maybe there will be
>>> something in the next release there.
>>>
>>> I hope this helps,
>>> -greg
>>>
>>>
>>> On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty <
>>> alexis.parenty.h...@gmail.com> wrote:
>>>
 Dear RDKit community,

 I need to screen for substructure relationships between two sets of
 structures (1 000 X 500 000): I thought I should build two lists of mol
 objects from SMILES, but I keep having a memory error when the second list
 reaches 300 000 mol. All my RAM (12G) gets consumed along with all my
 virtual memory.

 Do I really have to compromise on speed and make mol object on the
 flight from two lists of SMILES? Is there another memory efficient way to
 store mol object?

 Best,

 Alexis

 
 --
 Check out the vibrant tech community on one of the world's most
 engaging tech sites, Slashdot.org! http://sdm.link/slashdot
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


>>> 
>>> --
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>>
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net

Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Brian Kelley
 What exactly are you doing?

Is this 1000x500k substructure queries or something different?


Brian Kelley

> On Jun 9, 2017, at 9:12 AM, Alexis Parenty  
> wrote:
> 
> Dear Greg and Brian, 
> Many thanks for your response. I was also thinking of your streaming 
> approach! I think the RAM of most machine would deal with lists of 100K mol 
> so we could put the threshold higher than 1000. Actually, I was thinking to 
> monitor the available RAM and only start processing the matrix and clearing 
> the list when less than 20% of RAM is left. This way, the best machines could 
> skip the clearing process and gain time. What do you think?
> 
> 
> Best,
> 
> Alexis
> 
> 
> 
> 
> 
>> On 9 June 2017 at 14:40, Brian Kelley  wrote:
>> While not multithreaded (yet) this is the use case of the filter catalog:
>> 
>> http://rdkit.blogspot.com/2016/04/changes-in-201603-release-filtercatalog.html?m=1
>> 
>> Look for the SmartsMatcher class in the blog.
>> 
>> It is a good idea to make this multithreaded as well, I'll add this as a 
>> possible enhancement.
>> 
>> 
>> Brian Kelley
>> 
>>> On Jun 9, 2017, at 7:04 AM, Greg Landrum  wrote:
>>> 
>>> Hi Alexis,
>>> 
>>> I would approach this by loading the 1000 queries into a list of molecules 
>>> and then "stream" the others past that (so that you never attempt to load 
>>> the full 500K set at once).
>>> 
>>> Here's a quick sketch of one way to do this:
>>> In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if 
>>> x is not None]
>>> 
>>> In [5]: matches = []
>>> 
>>> In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
>>>...: if m is None:
>>>...: continue
>>>...: matches.append([m.HasSubstructMatch(q) for q in queries])
>>>...: 
>>> 
>>> 
>>> Brian has some thoughts on making this particular use case easier/faster 
>>> (in particular by adding multi-threading support), so maybe there will be 
>>> something in the next release there.
>>> 
>>> I hope this helps,
>>> -greg
>>> 
>>> 
 On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty 
  wrote:
 Dear RDKit community,
 
 I need to screen for substructure relationships between two sets of 
 structures (1 000 X 500 000): I thought I should build two lists of mol 
 objects from SMILES, but I keep having a memory error when the second list 
 reaches 300 000 mol. All my RAM (12G) gets consumed along with all my 
 virtual memory.
 
 Do I really have to compromise on speed and make mol object on the flight 
 from two lists of SMILES? Is there another memory efficient way to store 
 mol object?
 
 Best,
 
 Alexis
 
 
 --
 Check out the vibrant tech community on one of the world's most
 engaging tech sites, Slashdot.org! http://sdm.link/slashdot
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
 
>>> 
>>> --
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> 
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Alexis Parenty
Dear Greg and Brian,
Many thanks for your response. I was also thinking of your streaming
approach! I think the RAM of most machine would deal with lists of 100K mol
so we could put the threshold higher than 1000. Actually, I was thinking to
monitor the available RAM and only start processing the matrix and clearing
the list when less than 20% of RAM is left. This way, the best machines
could skip the clearing process and gain time. What do you think?


Best,

Alexis





On 9 June 2017 at 14:40, Brian Kelley  wrote:

> While not multithreaded (yet) this is the use case of the filter catalog:
>
> http://rdkit.blogspot.com/2016/04/changes-in-201603-
> release-filtercatalog.html?m=1
>
> Look for the SmartsMatcher class in the blog.
>
> It is a good idea to make this multithreaded as well, I'll add this as a
> possible enhancement.
>
> 
> Brian Kelley
>
> On Jun 9, 2017, at 7:04 AM, Greg Landrum  wrote:
>
> Hi Alexis,
>
> I would approach this by loading the 1000 queries into a list of molecules
> and then "stream" the others past that (so that you never attempt to load
> the full 500K set at once).
>
> Here's a quick sketch of one way to do this:
>
> In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf')
> if x is not None]
>
> In [5]: matches = []
>
> In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
>...: if m is None:
>...: continue
>...: matches.append([m.HasSubstructMatch(q) for q in queries])
>...:
>
>
>
> Brian has some thoughts on making this particular use case easier/faster
> (in particular by adding multi-threading support), so maybe there will be
> something in the next release there.
>
> I hope this helps,
> -greg
>
>
> On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty <
> alexis.parenty.h...@gmail.com> wrote:
>
>> Dear RDKit community,
>>
>> I need to screen for substructure relationships between two sets of
>> structures (1 000 X 500 000): I thought I should build two lists of mol
>> objects from SMILES, but I keep having a memory error when the second list
>> reaches 300 000 mol. All my RAM (12G) gets consumed along with all my
>> virtual memory.
>>
>> Do I really have to compromise on speed and make mol object on the flight
>> from two lists of SMILES? Is there another memory efficient way to store
>> mol object?
>>
>> Best,
>>
>> Alexis
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Brian Kelley
While not multithreaded (yet) this is the use case of the filter catalog:

http://rdkit.blogspot.com/2016/04/changes-in-201603-release-filtercatalog.html?m=1

Look for the SmartsMatcher class in the blog.

It is a good idea to make this multithreaded as well, I'll add this as a 
possible enhancement.


Brian Kelley

> On Jun 9, 2017, at 7:04 AM, Greg Landrum  wrote:
> 
> Hi Alexis,
> 
> I would approach this by loading the 1000 queries into a list of molecules 
> and then "stream" the others past that (so that you never attempt to load the 
> full 500K set at once).
> 
> Here's a quick sketch of one way to do this:
> In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x 
> is not None]
> 
> In [5]: matches = []
> 
> In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
>...: if m is None:
>...: continue
>...: matches.append([m.HasSubstructMatch(q) for q in queries])
>...: 
> 
> 
> Brian has some thoughts on making this particular use case easier/faster (in 
> particular by adding multi-threading support), so maybe there will be 
> something in the next release there.
> 
> I hope this helps,
> -greg
> 
> 
>> On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty 
>>  wrote:
>> Dear RDKit community,
>> 
>> I need to screen for substructure relationships between two sets of 
>> structures (1 000 X 500 000): I thought I should build two lists of mol 
>> objects from SMILES, but I keep having a memory error when the second list 
>> reaches 300 000 mol. All my RAM (12G) gets consumed along with all my 
>> virtual memory.
>> 
>> Do I really have to compromise on speed and make mol object on the flight 
>> from two lists of SMILES? Is there another memory efficient way to store mol 
>> object?
>> 
>> Best,
>> 
>> Alexis
>> 
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>> 
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-09 Thread Greg Landrum
Hi Alexis,

I would approach this by loading the 1000 queries into a list of molecules
and then "stream" the others past that (so that you never attempt to load
the full 500K set at once).

Here's a quick sketch of one way to do this:

In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if
x is not None]

In [5]: matches = []

In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'):
   ...: if m is None:
   ...: continue
   ...: matches.append([m.HasSubstructMatch(q) for q in queries])
   ...:



Brian has some thoughts on making this particular use case easier/faster
(in particular by adding multi-threading support), so maybe there will be
something in the next release there.

I hope this helps,
-greg


On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty <
alexis.parenty.h...@gmail.com> wrote:

> Dear RDKit community,
>
> I need to screen for substructure relationships between two sets of
> structures (1 000 X 500 000): I thought I should build two lists of mol
> objects from SMILES, but I keep having a memory error when the second list
> reaches 300 000 mol. All my RAM (12G) gets consumed along with all my
> virtual memory.
>
> Do I really have to compromise on speed and make mol object on the flight
> from two lists of SMILES? Is there another memory efficient way to store
> mol object?
>
> Best,
>
> Alexis
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] Memory issue when storing more than 300K mol in a list

2017-06-04 Thread Alexis Parenty
Dear RDKit community,

I need to screen for substructure relationships between two sets of
structures (1 000 X 500 000): I thought I should build two lists of mol
objects from SMILES, but I keep having a memory error when the second list
reaches 300 000 mol. All my RAM (12G) gets consumed along with all my
virtual memory.

Do I really have to compromise on speed and make mol object on the flight
from two lists of SMILES? Is there another memory efficient way to store
mol object?

Best,

Alexis
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Memory Issue

2015-07-15 Thread Greg Landrum
Hi,

It's not easy (for me) to read through the Java code and figure out what is
going on, but it looks to me like you are leaking rdmol in each iteration
of your loop.

The problem that the RDKit Java wrappers (really any Java wrapper created
with SWIG) has here is that the JVM doesn't know how big the underlying C++
object is, so it's not aggressive enough while cleaning up memory. I think
calling rdmol.delete() at the end of each iteration (this frees the
underlying C++ object) should help.

-greg


On Tuesday, July 14, 2015, Matthew Lardy mla...@gmail.com wrote:

 Hi all,

 I have had a strange issue that I can't seem to find a way around.  The
 following code block consumes a ton of memory, which is strange as just
 using the SD File reader I have no memory issues.  I think that the issue
 is related to the java garbage collection not being picked up, even though
 I have attempted to force that (to no success).

 All the following block does is iterate through an SD file and look for
 the highest (or lowest) scoring molecule for each molecule.  The assumption
 is that all molecules of the same type will be next to each other in the
 file (which is not my problem).  Running this on a SD file of around 400K
 molecules consumes around 23GB of memory, so if anyone has an idea I will
 be most appreciative!

public static void main(String argv[]) throws IOException,
 InterruptedException
{
   CommandLineParser cParser;
   String[] modes= {};
   String[] parms= {-in, -filterTag, -direction, -out};
   String[] reqParms = {-in, -filterTag, -direction, -out};

   String rdkitSO = System.getenv(RDKIT_SO);
   System.load(rdkitSO);


   String currentDir   = System.getProperty(user.dir);
   File dir = new File(currentDir);

   cParser = new
 CommandLineParser(EXPLAIN,0,0,argv,modes,parms,reqParms);

   ROMol rdmol  = null;
   ROMol rdmol2 = null;

   SDMolSupplier suppl = new SDMolSupplier(cParser.getValue(-in));
   SDWriter writer = new SDWriter(cParser.getValue(-out));
   int count = 0;

   while (!suppl.atEnd())
   {
   count++;
   if (count % 1000 == 0)
   {
  System.out.println(count);
   }
   rdmol = suppl.next();
   if (rdmol2 == null)
   {
 // rdmol2.delete();
  rdmol2 = new ROMol(rdmol);
  continue;
   }
   if (rdmol.MolToSmiles().equals(rdmol2.MolToSmiles()))
   {
   if ( cParser.getValue(-direction).equals(highest) )
   {
  double value1 =
 Double.parseDouble(rdmol.getProp(cParser.getValue(-filterTag)));
  double value2 =
 Double.parseDouble(rdmol2.getProp(cParser.getValue(-filterTag)));
  //System.out.println(Val1  + value1 +  Val2  +
 value2);
  if (value1  value2)
  {
  rdmol2.delete();
  rdmol2 = new ROMol(rdmol);
  }
   }
   else
   {
  if (
 Double.parseDouble(rdmol.getProp(cParser.getValue(-filterTag))) 
 Double.parseDouble(rdmol2.getProp(cParser.getValue(-filterTag))) )
  {
  rdmol2.delete();
  rdmol2 = new ROMol(rdmol);
  }
   }
   } else {
   writer.write(rdmol2);
   rdmol2.delete();
   rdmol2 = new ROMol(rdmol);
   }
   }
}


--
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Memory Issue

2015-07-15 Thread Matthew Lardy
Hi Greg,

I know what you mean.  :)  I had tried that before, but executing an
rdmol.delete() at the end of the loop didn't help.  And, I just re-tried
that to no avail.

I remember having a similar issue with the SDMolSupplier before, where just
reading the file consumed a ton of memory.  This was patched, and all of
the rest of my code runs well.  But if I want to sample from the
SDMolSupplier stream, things go weird.  I had hoped to copy the each rdmol
to a new object (reducing the leak) if I wanted to hold it for a time, but
that didn't help either.  I am deleting every molecule that I hold, but
there appears to be no impact on memory consumption.  I think that the JVM
is asleep killing these objects, as forcing it to do so (well, as much as
one can) doesn't fix things.

I may just have to write this in Python, where I am pretty certain the
memory issues are non-existant.  :)  I was hopeful that someone else may
have encountered this issue, and had a path around it.

Thanks for taking a look Greg!
Matt


On Wed, Jul 15, 2015 at 1:57 AM, Greg Landrum greg.land...@gmail.com
wrote:

 Hi,

 It's not easy (for me) to read through the Java code and figure out what
 is going on, but it looks to me like you are leaking rdmol in each
 iteration of your loop.

 The problem that the RDKit Java wrappers (really any Java wrapper created
 with SWIG) has here is that the JVM doesn't know how big the underlying C++
 object is, so it's not aggressive enough while cleaning up memory. I think
 calling rdmol.delete() at the end of each iteration (this frees the
 underlying C++ object) should help.

 -greg


 On Tuesday, July 14, 2015, Matthew Lardy mla...@gmail.com wrote:

 Hi all,

 I have had a strange issue that I can't seem to find a way around.  The
 following code block consumes a ton of memory, which is strange as just
 using the SD File reader I have no memory issues.  I think that the issue
 is related to the java garbage collection not being picked up, even though
 I have attempted to force that (to no success).

 All the following block does is iterate through an SD file and look for
 the highest (or lowest) scoring molecule for each molecule.  The assumption
 is that all molecules of the same type will be next to each other in the
 file (which is not my problem).  Running this on a SD file of around 400K
 molecules consumes around 23GB of memory, so if anyone has an idea I will
 be most appreciative!

public static void main(String argv[]) throws IOException,
 InterruptedException
{
   CommandLineParser cParser;
   String[] modes= {};
   String[] parms= {-in, -filterTag, -direction, -out};
   String[] reqParms = {-in, -filterTag, -direction, -out};

   String rdkitSO = System.getenv(RDKIT_SO);
   System.load(rdkitSO);


   String currentDir   = System.getProperty(user.dir);
   File dir = new File(currentDir);

   cParser = new
 CommandLineParser(EXPLAIN,0,0,argv,modes,parms,reqParms);

   ROMol rdmol  = null;
   ROMol rdmol2 = null;

   SDMolSupplier suppl = new SDMolSupplier(cParser.getValue(-in));
   SDWriter writer = new SDWriter(cParser.getValue(-out));
   int count = 0;

   while (!suppl.atEnd())
   {
   count++;
   if (count % 1000 == 0)
   {
  System.out.println(count);
   }
   rdmol = suppl.next();
   if (rdmol2 == null)
   {
 // rdmol2.delete();
  rdmol2 = new ROMol(rdmol);
  continue;
   }
   if (rdmol.MolToSmiles().equals(rdmol2.MolToSmiles()))
   {
   if ( cParser.getValue(-direction).equals(highest) )
   {
  double value1 =
 Double.parseDouble(rdmol.getProp(cParser.getValue(-filterTag)));
  double value2 =
 Double.parseDouble(rdmol2.getProp(cParser.getValue(-filterTag)));
  //System.out.println(Val1  + value1 +  Val2  +
 value2);
  if (value1  value2)
  {
  rdmol2.delete();
  rdmol2 = new ROMol(rdmol);
  }
   }
   else
   {
  if (
 Double.parseDouble(rdmol.getProp(cParser.getValue(-filterTag))) 
 Double.parseDouble(rdmol2.getProp(cParser.getValue(-filterTag))) )
  {
  rdmol2.delete();
  rdmol2 = new ROMol(rdmol);
  }
   }
   } else {
   writer.write(rdmol2);
   rdmol2.delete();
   rdmol2 = new ROMol(rdmol);
   }
   }
}


--
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.

Re: [Rdkit-discuss] Memory Issue

2015-07-15 Thread Matthew Lardy
Just to add, I can confirm that re-writing this in Python did indeed bounce
the memory issue I've been having.  Total consumption never crossed 0.1% of
my system memory.  :)  Way less than the 89% I was seeing with the Java
version of the same application!

On Wed, Jul 15, 2015 at 2:05 PM, Matthew Lardy mla...@gmail.com wrote:

 Hi Greg,

 I know what you mean.  :)  I had tried that before, but executing an
 rdmol.delete() at the end of the loop didn't help.  And, I just re-tried
 that to no avail.

 I remember having a similar issue with the SDMolSupplier before, where
 just reading the file consumed a ton of memory.  This was patched, and all
 of the rest of my code runs well.  But if I want to sample from the
 SDMolSupplier stream, things go weird.  I had hoped to copy the each rdmol
 to a new object (reducing the leak) if I wanted to hold it for a time, but
 that didn't help either.  I am deleting every molecule that I hold, but
 there appears to be no impact on memory consumption.  I think that the JVM
 is asleep killing these objects, as forcing it to do so (well, as much as
 one can) doesn't fix things.

 I may just have to write this in Python, where I am pretty certain the
 memory issues are non-existant.  :)  I was hopeful that someone else may
 have encountered this issue, and had a path around it.

 Thanks for taking a look Greg!
 Matt


 On Wed, Jul 15, 2015 at 1:57 AM, Greg Landrum greg.land...@gmail.com
 wrote:

 Hi,

 It's not easy (for me) to read through the Java code and figure out what
 is going on, but it looks to me like you are leaking rdmol in each
 iteration of your loop.

 The problem that the RDKit Java wrappers (really any Java wrapper created
 with SWIG) has here is that the JVM doesn't know how big the underlying C++
 object is, so it's not aggressive enough while cleaning up memory. I think
 calling rdmol.delete() at the end of each iteration (this frees the
 underlying C++ object) should help.

 -greg


 On Tuesday, July 14, 2015, Matthew Lardy mla...@gmail.com wrote:

 Hi all,

 I have had a strange issue that I can't seem to find a way around.  The
 following code block consumes a ton of memory, which is strange as just
 using the SD File reader I have no memory issues.  I think that the issue
 is related to the java garbage collection not being picked up, even though
 I have attempted to force that (to no success).

 All the following block does is iterate through an SD file and look for
 the highest (or lowest) scoring molecule for each molecule.  The assumption
 is that all molecules of the same type will be next to each other in the
 file (which is not my problem).  Running this on a SD file of around 400K
 molecules consumes around 23GB of memory, so if anyone has an idea I will
 be most appreciative!

public static void main(String argv[]) throws IOException,
 InterruptedException
{
   CommandLineParser cParser;
   String[] modes= {};
   String[] parms= {-in, -filterTag, -direction, -out};
   String[] reqParms = {-in, -filterTag, -direction, -out};

   String rdkitSO = System.getenv(RDKIT_SO);
   System.load(rdkitSO);


   String currentDir   = System.getProperty(user.dir);
   File dir = new File(currentDir);

   cParser = new
 CommandLineParser(EXPLAIN,0,0,argv,modes,parms,reqParms);

   ROMol rdmol  = null;
   ROMol rdmol2 = null;

   SDMolSupplier suppl = new SDMolSupplier(cParser.getValue(-in));
   SDWriter writer = new SDWriter(cParser.getValue(-out));
   int count = 0;

   while (!suppl.atEnd())
   {
   count++;
   if (count % 1000 == 0)
   {
  System.out.println(count);
   }
   rdmol = suppl.next();
   if (rdmol2 == null)
   {
 // rdmol2.delete();
  rdmol2 = new ROMol(rdmol);
  continue;
   }
   if (rdmol.MolToSmiles().equals(rdmol2.MolToSmiles()))
   {
   if ( cParser.getValue(-direction).equals(highest) )
   {
  double value1 =
 Double.parseDouble(rdmol.getProp(cParser.getValue(-filterTag)));
  double value2 =
 Double.parseDouble(rdmol2.getProp(cParser.getValue(-filterTag)));
  //System.out.println(Val1  + value1 +  Val2  +
 value2);
  if (value1  value2)
  {
  rdmol2.delete();
  rdmol2 = new ROMol(rdmol);
  }
   }
   else
   {
  if (
 Double.parseDouble(rdmol.getProp(cParser.getValue(-filterTag))) 
 Double.parseDouble(rdmol2.getProp(cParser.getValue(-filterTag))) )
  {
  rdmol2.delete();
  rdmol2 = new ROMol(rdmol);
  }
   }
   } else {
   writer.write(rdmol2);
   rdmol2.delete();
   rdmol2 = new ROMol(rdmol);
 

[Rdkit-discuss] Memory Issue

2015-07-14 Thread Matthew Lardy
Hi all,

I have had a strange issue that I can't seem to find a way around.  The
following code block consumes a ton of memory, which is strange as just
using the SD File reader I have no memory issues.  I think that the issue
is related to the java garbage collection not being picked up, even though
I have attempted to force that (to no success).

All the following block does is iterate through an SD file and look for the
highest (or lowest) scoring molecule for each molecule.  The assumption is
that all molecules of the same type will be next to each other in the file
(which is not my problem).  Running this on a SD file of around 400K
molecules consumes around 23GB of memory, so if anyone has an idea I will
be most appreciative!

   public static void main(String argv[]) throws IOException,
InterruptedException
   {
  CommandLineParser cParser;
  String[] modes= {};
  String[] parms= {-in, -filterTag, -direction, -out};
  String[] reqParms = {-in, -filterTag, -direction, -out};

  String rdkitSO = System.getenv(RDKIT_SO);
  System.load(rdkitSO);


  String currentDir   = System.getProperty(user.dir);
  File dir = new File(currentDir);

  cParser = new
CommandLineParser(EXPLAIN,0,0,argv,modes,parms,reqParms);

  ROMol rdmol  = null;
  ROMol rdmol2 = null;

  SDMolSupplier suppl = new SDMolSupplier(cParser.getValue(-in));
  SDWriter writer = new SDWriter(cParser.getValue(-out));
  int count = 0;

  while (!suppl.atEnd())
  {
  count++;
  if (count % 1000 == 0)
  {
 System.out.println(count);
  }
  rdmol = suppl.next();
  if (rdmol2 == null)
  {
// rdmol2.delete();
 rdmol2 = new ROMol(rdmol);
 continue;
  }
  if (rdmol.MolToSmiles().equals(rdmol2.MolToSmiles()))
  {
  if ( cParser.getValue(-direction).equals(highest) )
  {
 double value1 =
Double.parseDouble(rdmol.getProp(cParser.getValue(-filterTag)));
 double value2 =
Double.parseDouble(rdmol2.getProp(cParser.getValue(-filterTag)));
 //System.out.println(Val1  + value1 +  Val2  + value2);
 if (value1  value2)
 {
 rdmol2.delete();
 rdmol2 = new ROMol(rdmol);
 }
  }
  else
  {
 if (
Double.parseDouble(rdmol.getProp(cParser.getValue(-filterTag))) 
Double.parseDouble(rdmol2.getProp(cParser.getValue(-filterTag))) )
 {
 rdmol2.delete();
 rdmol2 = new ROMol(rdmol);
 }
  }
  } else {
  writer.write(rdmol2);
  rdmol2.delete();
  rdmol2 = new ROMol(rdmol);
  }
  }
   }
--
Don't Limit Your Business. Reach for the Cloud.
GigeNET's Cloud Solutions provide you with the tools and support that
you need to offload your IT needs and focus on growing your business.
Configured For All Businesses. Start Your Cloud Today.
https://www.gigenetcloud.com/___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss