Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
On 2017-06-10 07:42, Chris Swain wrote: This sounds like the situation where a database might be a better option, tuned to store fingerprints in RAM? The issue is how much programming time it will take, how much that time is worth, and how many times the solution will be reused. A clever coding solution could be preferable for other reasons, like a programming exercise. If it's a one-off and you just need it done and move on, throwing more hardware at it is often the most cost-effective solution. Dima -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
This sounds like the situation where a database might be a better option, tuned to store fingerprints in RAM? Chris Dr Chris Swain BA MA (Cantab) PhD CChem FRSC Macs in Chemistry sw...@mac.com http://www.macinchem.org > On 10 Jun 2017, at 13:10, rdkit-discuss-requ...@lists.sourceforge.net wrote: > > Send Rdkit-discuss mailing list submissions to > rdkit-discuss@lists.sourceforge.net > > To subscribe or unsubscribe via the World Wide Web, visit > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > or, via email, send a message with subject or body 'help' to > rdkit-discuss-requ...@lists.sourceforge.net > > You can reach the person managing the list at > rdkit-discuss-ow...@lists.sourceforge.net > > When replying, please edit your Subject line so it is more specific > than "Re: Contents of Rdkit-discuss digest..." > > > Today's Topics: > > 1. Re: Memory issue when storing more than 300K mol in a list > (Alexis Parenty) > 2. Re: Memory issue when storing more than 300K mol in a list > (Dimitri Maziuk) > > > -- > > Message: 1 > Date: Fri, 9 Jun 2017 16:28:09 +0200 > From: Alexis Parenty <alexis.parenty.h...@gmail.com> > To: Greg Landrum <greg.land...@gmail.com> > Cc: RDKit Discuss <rdkit-discuss@lists.sourceforge.net> > Subject: Re: [Rdkit-discuss] Memory issue when storing more than 300K > mol in a list > Message-ID: > <cal3fkckr2zqtcjdc8qf_i4jhlhm+jectrif-gzu6ndg4aka...@mail.gmail.com> > Content-Type: text/plain; charset="utf-8" > > Yes Greg, this is what I am doing. You?re right, I did not think of the > possibility to build a list of mol from the shorter list and process each > of its mol with the mol of the longer list (which I would make on the > flight from the smiles). However, I wanted to store the longest list of > structures in order to access it again later for new substructure search > from single structure at a time? It seemed silly to have to rebuild mol > object from a 500K list of smiles every time I need to do a new > substructure search. But your approach is going to help me a lot for the > batch mode search I wanted to do. > > Best, > > Alexis > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
On 2017-06-09 08:12, Alexis Parenty wrote: Dear Greg and Brian, Many thanks for your response. I was also thinking of your streaming approach! I think the RAM of most machine would deal with lists of 100K mol so we could put the threshold higher than 1000. Actually, I was thinking to monitor the available RAM and only start processing the matrix and clearing the list when less than 20% of RAM is left. This way, the best machines could skip the clearing process and gain time. What do you think? Take $100, buy a 200GB SSD, set it up as the swap space, don't worry about the RAM. Dima -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
Yes Greg, this is what I am doing. You’re right, I did not think of the possibility to build a list of mol from the shorter list and process each of its mol with the mol of the longer list (which I would make on the flight from the smiles). However, I wanted to store the longest list of structures in order to access it again later for new substructure search from single structure at a time… It seemed silly to have to rebuild mol object from a 500K list of smiles every time I need to do a new substructure search. But your approach is going to help me a lot for the batch mode search I wanted to do. Best, Alexis On 9 June 2017 at 15:42, Greg Landrumwrote: > Hi Alexis, > > If I understand your use case correctly, you really don't need this level > of complication. > > If you are comparing Q molecules to M molecules and M>>Q (in the > discussion so far Q = 1000, M = 50) and you only need to compare each > of the Qs to each of the Ms a single time, you can safely construct all the > Q molecules and store them in memory and then loop over the Ms individually > and compare them to each of the Qs (this is what I did in my little > sample). This will have more or less exactly the same performance as > reading all of the Ms at once and then processing them. > > so, on a machine with infinite memory these two snippets will take more or > less the same amount of time to execute: > > low memory usage: > > queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is > not None] > matches = [] > for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'): > if m is None: > continue > matches.append([m.HasSubstructMatch(q) for q in queries]) > > > > high memory usage: > > queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is > not None] > mols = [x for x in Chem.ForwardSDMolSupplier('./znp.50k.sdf') if x is not > None] > matches = [] > for m in mols: > if m is None: > continue > matches.append([m.HasSubstructMatch(q) for q in queries]) > > > > The second form consumes a lot more memory without delivering any > improvement in performance. > > Best, > -greg > > > On Fri, Jun 9, 2017 at 3:33 PM, Alexis Parenty < > alexis.parenty.h...@gmail.com> wrote: > >> Hi again, FYI here is the memory monitoring in attachment. Thanks, >> >> Alexis >> >> On 9 June 2017 at 15:12, Alexis Parenty >> wrote: >> >>> Dear Greg and Brian, >>> Many thanks for your response. I was also thinking of your streaming >>> approach! I think the RAM of most machine would deal with lists of 100K mol >>> so we could put the threshold higher than 1000. Actually, I was thinking to >>> monitor the available RAM and only start processing the matrix and clearing >>> the list when less than 20% of RAM is left. This way, the best machines >>> could skip the clearing process and gain time. What do you think? >>> >>> >>> Best, >>> >>> Alexis >>> >>> >>> >>> >>> >>> On 9 June 2017 at 14:40, Brian Kelley wrote: >>> While not multithreaded (yet) this is the use case of the filter catalog: http://rdkit.blogspot.com/2016/04/changes-in-201603-release- filtercatalog.html?m=1 Look for the SmartsMatcher class in the blog. It is a good idea to make this multithreaded as well, I'll add this as a possible enhancement. Brian Kelley On Jun 9, 2017, at 7:04 AM, Greg Landrum wrote: Hi Alexis, I would approach this by loading the 1000 queries into a list of molecules and then "stream" the others past that (so that you never attempt to load the full 500K set at once). Here's a quick sketch of one way to do this: In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is not None] In [5]: matches = [] In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'): ...: if m is None: ...: continue ...: matches.append([m.HasSubstructMatch(q) for q in queries]) ...: Brian has some thoughts on making this particular use case easier/faster (in particular by adding multi-threading support), so maybe there will be something in the next release there. I hope this helps, -greg On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty < alexis.parenty.h...@gmail.com> wrote: > Dear RDKit community, > > I need to screen for substructure relationships between two sets of > structures (1 000 X 500 000): I thought I should build two lists of mol > objects from SMILES, but I keep having a memory error when the second list > reaches 300 000 mol. All my RAM (12G) gets consumed along with all my > virtual memory. > > Do I really have to compromise on speed and make mol object on the > flight from two lists of SMILES?
Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
Hi Alexis, If I understand your use case correctly, you really don't need this level of complication. If you are comparing Q molecules to M molecules and M>>Q (in the discussion so far Q = 1000, M = 50) and you only need to compare each of the Qs to each of the Ms a single time, you can safely construct all the Q molecules and store them in memory and then loop over the Ms individually and compare them to each of the Qs (this is what I did in my little sample). This will have more or less exactly the same performance as reading all of the Ms at once and then processing them. so, on a machine with infinite memory these two snippets will take more or less the same amount of time to execute: low memory usage: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is not None] matches = [] for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'): if m is None: continue matches.append([m.HasSubstructMatch(q) for q in queries]) high memory usage: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is not None] mols = [x for x in Chem.ForwardSDMolSupplier('./znp.50k.sdf') if x is not None] matches = [] for m in mols: if m is None: continue matches.append([m.HasSubstructMatch(q) for q in queries]) The second form consumes a lot more memory without delivering any improvement in performance. Best, -greg On Fri, Jun 9, 2017 at 3:33 PM, Alexis Parenty < alexis.parenty.h...@gmail.com> wrote: > Hi again, FYI here is the memory monitoring in attachment. Thanks, > > Alexis > > On 9 June 2017 at 15:12, Alexis Parenty> wrote: > >> Dear Greg and Brian, >> Many thanks for your response. I was also thinking of your streaming >> approach! I think the RAM of most machine would deal with lists of 100K mol >> so we could put the threshold higher than 1000. Actually, I was thinking to >> monitor the available RAM and only start processing the matrix and clearing >> the list when less than 20% of RAM is left. This way, the best machines >> could skip the clearing process and gain time. What do you think? >> >> >> Best, >> >> Alexis >> >> >> >> >> >> On 9 June 2017 at 14:40, Brian Kelley wrote: >> >>> While not multithreaded (yet) this is the use case of the filter catalog: >>> >>> http://rdkit.blogspot.com/2016/04/changes-in-201603-release- >>> filtercatalog.html?m=1 >>> >>> Look for the SmartsMatcher class in the blog. >>> >>> It is a good idea to make this multithreaded as well, I'll add this as a >>> possible enhancement. >>> >>> >>> Brian Kelley >>> >>> On Jun 9, 2017, at 7:04 AM, Greg Landrum wrote: >>> >>> Hi Alexis, >>> >>> I would approach this by loading the 1000 queries into a list of >>> molecules and then "stream" the others past that (so that you never attempt >>> to load the full 500K set at once). >>> >>> Here's a quick sketch of one way to do this: >>> >>> In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') >>> if x is not None] >>> >>> In [5]: matches = [] >>> >>> In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'): >>>...: if m is None: >>>...: continue >>>...: matches.append([m.HasSubstructMatch(q) for q in queries]) >>>...: >>> >>> >>> >>> Brian has some thoughts on making this particular use case easier/faster >>> (in particular by adding multi-threading support), so maybe there will be >>> something in the next release there. >>> >>> I hope this helps, >>> -greg >>> >>> >>> On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty < >>> alexis.parenty.h...@gmail.com> wrote: >>> Dear RDKit community, I need to screen for substructure relationships between two sets of structures (1 000 X 500 000): I thought I should build two lists of mol objects from SMILES, but I keep having a memory error when the second list reaches 300 000 mol. All my RAM (12G) gets consumed along with all my virtual memory. Do I really have to compromise on speed and make mol object on the flight from two lists of SMILES? Is there another memory efficient way to store mol object? Best, Alexis -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >>> -- >>> Check out the vibrant tech community on one of the world's most >>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >>> >>> ___ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net
Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
What exactly are you doing? Is this 1000x500k substructure queries or something different? Brian Kelley > On Jun 9, 2017, at 9:12 AM, Alexis Parenty> wrote: > > Dear Greg and Brian, > Many thanks for your response. I was also thinking of your streaming > approach! I think the RAM of most machine would deal with lists of 100K mol > so we could put the threshold higher than 1000. Actually, I was thinking to > monitor the available RAM and only start processing the matrix and clearing > the list when less than 20% of RAM is left. This way, the best machines could > skip the clearing process and gain time. What do you think? > > > Best, > > Alexis > > > > > >> On 9 June 2017 at 14:40, Brian Kelley wrote: >> While not multithreaded (yet) this is the use case of the filter catalog: >> >> http://rdkit.blogspot.com/2016/04/changes-in-201603-release-filtercatalog.html?m=1 >> >> Look for the SmartsMatcher class in the blog. >> >> It is a good idea to make this multithreaded as well, I'll add this as a >> possible enhancement. >> >> >> Brian Kelley >> >>> On Jun 9, 2017, at 7:04 AM, Greg Landrum wrote: >>> >>> Hi Alexis, >>> >>> I would approach this by loading the 1000 queries into a list of molecules >>> and then "stream" the others past that (so that you never attempt to load >>> the full 500K set at once). >>> >>> Here's a quick sketch of one way to do this: >>> In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if >>> x is not None] >>> >>> In [5]: matches = [] >>> >>> In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'): >>>...: if m is None: >>>...: continue >>>...: matches.append([m.HasSubstructMatch(q) for q in queries]) >>>...: >>> >>> >>> Brian has some thoughts on making this particular use case easier/faster >>> (in particular by adding multi-threading support), so maybe there will be >>> something in the next release there. >>> >>> I hope this helps, >>> -greg >>> >>> On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty wrote: Dear RDKit community, I need to screen for substructure relationships between two sets of structures (1 000 X 500 000): I thought I should build two lists of mol objects from SMILES, but I keep having a memory error when the second list reaches 300 000 mol. All my RAM (12G) gets consumed along with all my virtual memory. Do I really have to compromise on speed and make mol object on the flight from two lists of SMILES? Is there another memory efficient way to store mol object? Best, Alexis -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot ___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >>> >>> -- >>> Check out the vibrant tech community on one of the world's most >>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >>> ___ >>> Rdkit-discuss mailing list >>> Rdkit-discuss@lists.sourceforge.net >>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
Dear Greg and Brian, Many thanks for your response. I was also thinking of your streaming approach! I think the RAM of most machine would deal with lists of 100K mol so we could put the threshold higher than 1000. Actually, I was thinking to monitor the available RAM and only start processing the matrix and clearing the list when less than 20% of RAM is left. This way, the best machines could skip the clearing process and gain time. What do you think? Best, Alexis On 9 June 2017 at 14:40, Brian Kelleywrote: > While not multithreaded (yet) this is the use case of the filter catalog: > > http://rdkit.blogspot.com/2016/04/changes-in-201603- > release-filtercatalog.html?m=1 > > Look for the SmartsMatcher class in the blog. > > It is a good idea to make this multithreaded as well, I'll add this as a > possible enhancement. > > > Brian Kelley > > On Jun 9, 2017, at 7:04 AM, Greg Landrum wrote: > > Hi Alexis, > > I would approach this by loading the 1000 queries into a list of molecules > and then "stream" the others past that (so that you never attempt to load > the full 500K set at once). > > Here's a quick sketch of one way to do this: > > In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') > if x is not None] > > In [5]: matches = [] > > In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'): >...: if m is None: >...: continue >...: matches.append([m.HasSubstructMatch(q) for q in queries]) >...: > > > > Brian has some thoughts on making this particular use case easier/faster > (in particular by adding multi-threading support), so maybe there will be > something in the next release there. > > I hope this helps, > -greg > > > On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty < > alexis.parenty.h...@gmail.com> wrote: > >> Dear RDKit community, >> >> I need to screen for substructure relationships between two sets of >> structures (1 000 X 500 000): I thought I should build two lists of mol >> objects from SMILES, but I keep having a memory error when the second list >> reaches 300 000 mol. All my RAM (12G) gets consumed along with all my >> virtual memory. >> >> Do I really have to compromise on speed and make mol object on the flight >> from two lists of SMILES? Is there another memory efficient way to store >> mol object? >> >> Best, >> >> Alexis >> >> >> -- >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> >> > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
While not multithreaded (yet) this is the use case of the filter catalog: http://rdkit.blogspot.com/2016/04/changes-in-201603-release-filtercatalog.html?m=1 Look for the SmartsMatcher class in the blog. It is a good idea to make this multithreaded as well, I'll add this as a possible enhancement. Brian Kelley > On Jun 9, 2017, at 7:04 AM, Greg Landrumwrote: > > Hi Alexis, > > I would approach this by loading the 1000 queries into a list of molecules > and then "stream" the others past that (so that you never attempt to load the > full 500K set at once). > > Here's a quick sketch of one way to do this: > In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x > is not None] > > In [5]: matches = [] > > In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'): >...: if m is None: >...: continue >...: matches.append([m.HasSubstructMatch(q) for q in queries]) >...: > > > Brian has some thoughts on making this particular use case easier/faster (in > particular by adding multi-threading support), so maybe there will be > something in the next release there. > > I hope this helps, > -greg > > >> On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty >> wrote: >> Dear RDKit community, >> >> I need to screen for substructure relationships between two sets of >> structures (1 000 X 500 000): I thought I should build two lists of mol >> objects from SMILES, but I keep having a memory error when the second list >> reaches 300 000 mol. All my RAM (12G) gets consumed along with all my >> virtual memory. >> >> Do I really have to compromise on speed and make mol object on the flight >> from two lists of SMILES? Is there another memory efficient way to store mol >> object? >> >> Best, >> >> Alexis >> >> >> -- >> Check out the vibrant tech community on one of the world's most >> engaging tech sites, Slashdot.org! http://sdm.link/slashdot >> ___ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Memory issue when storing more than 300K mol in a list
Hi Alexis, I would approach this by loading the 1000 queries into a list of molecules and then "stream" the others past that (so that you never attempt to load the full 500K set at once). Here's a quick sketch of one way to do this: In [4]: queries = [x for x in Chem.ForwardSDMolSupplier('mols.1000.sdf') if x is not None] In [5]: matches = [] In [6]: for m in Chem.ForwardSDMolSupplier('./znp.50k.sdf'): ...: if m is None: ...: continue ...: matches.append([m.HasSubstructMatch(q) for q in queries]) ...: Brian has some thoughts on making this particular use case easier/faster (in particular by adding multi-threading support), so maybe there will be something in the next release there. I hope this helps, -greg On Sun, Jun 4, 2017 at 10:25 PM, Alexis Parenty < alexis.parenty.h...@gmail.com> wrote: > Dear RDKit community, > > I need to screen for substructure relationships between two sets of > structures (1 000 X 500 000): I thought I should build two lists of mol > objects from SMILES, but I keep having a memory error when the second list > reaches 300 000 mol. All my RAM (12G) gets consumed along with all my > virtual memory. > > Do I really have to compromise on speed and make mol object on the flight > from two lists of SMILES? Is there another memory efficient way to store > mol object? > > Best, > > Alexis > > > -- > Check out the vibrant tech community on one of the world's most > engaging tech sites, Slashdot.org! http://sdm.link/slashdot > ___ > Rdkit-discuss mailing list > Rdkit-discuss@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss > > -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
[Rdkit-discuss] Memory issue when storing more than 300K mol in a list
Dear RDKit community, I need to screen for substructure relationships between two sets of structures (1 000 X 500 000): I thought I should build two lists of mol objects from SMILES, but I keep having a memory error when the second list reaches 300 000 mol. All my RAM (12G) gets consumed along with all my virtual memory. Do I really have to compromise on speed and make mol object on the flight from two lists of SMILES? Is there another memory efficient way to store mol object? Best, Alexis -- Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Memory Issue
Hi, It's not easy (for me) to read through the Java code and figure out what is going on, but it looks to me like you are leaking rdmol in each iteration of your loop. The problem that the RDKit Java wrappers (really any Java wrapper created with SWIG) has here is that the JVM doesn't know how big the underlying C++ object is, so it's not aggressive enough while cleaning up memory. I think calling rdmol.delete() at the end of each iteration (this frees the underlying C++ object) should help. -greg On Tuesday, July 14, 2015, Matthew Lardy mla...@gmail.com wrote: Hi all, I have had a strange issue that I can't seem to find a way around. The following code block consumes a ton of memory, which is strange as just using the SD File reader I have no memory issues. I think that the issue is related to the java garbage collection not being picked up, even though I have attempted to force that (to no success). All the following block does is iterate through an SD file and look for the highest (or lowest) scoring molecule for each molecule. The assumption is that all molecules of the same type will be next to each other in the file (which is not my problem). Running this on a SD file of around 400K molecules consumes around 23GB of memory, so if anyone has an idea I will be most appreciative! public static void main(String argv[]) throws IOException, InterruptedException { CommandLineParser cParser; String[] modes= {}; String[] parms= {-in, -filterTag, -direction, -out}; String[] reqParms = {-in, -filterTag, -direction, -out}; String rdkitSO = System.getenv(RDKIT_SO); System.load(rdkitSO); String currentDir = System.getProperty(user.dir); File dir = new File(currentDir); cParser = new CommandLineParser(EXPLAIN,0,0,argv,modes,parms,reqParms); ROMol rdmol = null; ROMol rdmol2 = null; SDMolSupplier suppl = new SDMolSupplier(cParser.getValue(-in)); SDWriter writer = new SDWriter(cParser.getValue(-out)); int count = 0; while (!suppl.atEnd()) { count++; if (count % 1000 == 0) { System.out.println(count); } rdmol = suppl.next(); if (rdmol2 == null) { // rdmol2.delete(); rdmol2 = new ROMol(rdmol); continue; } if (rdmol.MolToSmiles().equals(rdmol2.MolToSmiles())) { if ( cParser.getValue(-direction).equals(highest) ) { double value1 = Double.parseDouble(rdmol.getProp(cParser.getValue(-filterTag))); double value2 = Double.parseDouble(rdmol2.getProp(cParser.getValue(-filterTag))); //System.out.println(Val1 + value1 + Val2 + value2); if (value1 value2) { rdmol2.delete(); rdmol2 = new ROMol(rdmol); } } else { if ( Double.parseDouble(rdmol.getProp(cParser.getValue(-filterTag))) Double.parseDouble(rdmol2.getProp(cParser.getValue(-filterTag))) ) { rdmol2.delete(); rdmol2 = new ROMol(rdmol); } } } else { writer.write(rdmol2); rdmol2.delete(); rdmol2 = new ROMol(rdmol); } } } -- Don't Limit Your Business. Reach for the Cloud. GigeNET's Cloud Solutions provide you with the tools and support that you need to offload your IT needs and focus on growing your business. Configured For All Businesses. Start Your Cloud Today. https://www.gigenetcloud.com/___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
Re: [Rdkit-discuss] Memory Issue
Hi Greg, I know what you mean. :) I had tried that before, but executing an rdmol.delete() at the end of the loop didn't help. And, I just re-tried that to no avail. I remember having a similar issue with the SDMolSupplier before, where just reading the file consumed a ton of memory. This was patched, and all of the rest of my code runs well. But if I want to sample from the SDMolSupplier stream, things go weird. I had hoped to copy the each rdmol to a new object (reducing the leak) if I wanted to hold it for a time, but that didn't help either. I am deleting every molecule that I hold, but there appears to be no impact on memory consumption. I think that the JVM is asleep killing these objects, as forcing it to do so (well, as much as one can) doesn't fix things. I may just have to write this in Python, where I am pretty certain the memory issues are non-existant. :) I was hopeful that someone else may have encountered this issue, and had a path around it. Thanks for taking a look Greg! Matt On Wed, Jul 15, 2015 at 1:57 AM, Greg Landrum greg.land...@gmail.com wrote: Hi, It's not easy (for me) to read through the Java code and figure out what is going on, but it looks to me like you are leaking rdmol in each iteration of your loop. The problem that the RDKit Java wrappers (really any Java wrapper created with SWIG) has here is that the JVM doesn't know how big the underlying C++ object is, so it's not aggressive enough while cleaning up memory. I think calling rdmol.delete() at the end of each iteration (this frees the underlying C++ object) should help. -greg On Tuesday, July 14, 2015, Matthew Lardy mla...@gmail.com wrote: Hi all, I have had a strange issue that I can't seem to find a way around. The following code block consumes a ton of memory, which is strange as just using the SD File reader I have no memory issues. I think that the issue is related to the java garbage collection not being picked up, even though I have attempted to force that (to no success). All the following block does is iterate through an SD file and look for the highest (or lowest) scoring molecule for each molecule. The assumption is that all molecules of the same type will be next to each other in the file (which is not my problem). Running this on a SD file of around 400K molecules consumes around 23GB of memory, so if anyone has an idea I will be most appreciative! public static void main(String argv[]) throws IOException, InterruptedException { CommandLineParser cParser; String[] modes= {}; String[] parms= {-in, -filterTag, -direction, -out}; String[] reqParms = {-in, -filterTag, -direction, -out}; String rdkitSO = System.getenv(RDKIT_SO); System.load(rdkitSO); String currentDir = System.getProperty(user.dir); File dir = new File(currentDir); cParser = new CommandLineParser(EXPLAIN,0,0,argv,modes,parms,reqParms); ROMol rdmol = null; ROMol rdmol2 = null; SDMolSupplier suppl = new SDMolSupplier(cParser.getValue(-in)); SDWriter writer = new SDWriter(cParser.getValue(-out)); int count = 0; while (!suppl.atEnd()) { count++; if (count % 1000 == 0) { System.out.println(count); } rdmol = suppl.next(); if (rdmol2 == null) { // rdmol2.delete(); rdmol2 = new ROMol(rdmol); continue; } if (rdmol.MolToSmiles().equals(rdmol2.MolToSmiles())) { if ( cParser.getValue(-direction).equals(highest) ) { double value1 = Double.parseDouble(rdmol.getProp(cParser.getValue(-filterTag))); double value2 = Double.parseDouble(rdmol2.getProp(cParser.getValue(-filterTag))); //System.out.println(Val1 + value1 + Val2 + value2); if (value1 value2) { rdmol2.delete(); rdmol2 = new ROMol(rdmol); } } else { if ( Double.parseDouble(rdmol.getProp(cParser.getValue(-filterTag))) Double.parseDouble(rdmol2.getProp(cParser.getValue(-filterTag))) ) { rdmol2.delete(); rdmol2 = new ROMol(rdmol); } } } else { writer.write(rdmol2); rdmol2.delete(); rdmol2 = new ROMol(rdmol); } } } -- Don't Limit Your Business. Reach for the Cloud. GigeNET's Cloud Solutions provide you with the tools and support that you need to offload your IT needs and focus on growing your business. Configured For All Businesses. Start Your Cloud Today.
Re: [Rdkit-discuss] Memory Issue
Just to add, I can confirm that re-writing this in Python did indeed bounce the memory issue I've been having. Total consumption never crossed 0.1% of my system memory. :) Way less than the 89% I was seeing with the Java version of the same application! On Wed, Jul 15, 2015 at 2:05 PM, Matthew Lardy mla...@gmail.com wrote: Hi Greg, I know what you mean. :) I had tried that before, but executing an rdmol.delete() at the end of the loop didn't help. And, I just re-tried that to no avail. I remember having a similar issue with the SDMolSupplier before, where just reading the file consumed a ton of memory. This was patched, and all of the rest of my code runs well. But if I want to sample from the SDMolSupplier stream, things go weird. I had hoped to copy the each rdmol to a new object (reducing the leak) if I wanted to hold it for a time, but that didn't help either. I am deleting every molecule that I hold, but there appears to be no impact on memory consumption. I think that the JVM is asleep killing these objects, as forcing it to do so (well, as much as one can) doesn't fix things. I may just have to write this in Python, where I am pretty certain the memory issues are non-existant. :) I was hopeful that someone else may have encountered this issue, and had a path around it. Thanks for taking a look Greg! Matt On Wed, Jul 15, 2015 at 1:57 AM, Greg Landrum greg.land...@gmail.com wrote: Hi, It's not easy (for me) to read through the Java code and figure out what is going on, but it looks to me like you are leaking rdmol in each iteration of your loop. The problem that the RDKit Java wrappers (really any Java wrapper created with SWIG) has here is that the JVM doesn't know how big the underlying C++ object is, so it's not aggressive enough while cleaning up memory. I think calling rdmol.delete() at the end of each iteration (this frees the underlying C++ object) should help. -greg On Tuesday, July 14, 2015, Matthew Lardy mla...@gmail.com wrote: Hi all, I have had a strange issue that I can't seem to find a way around. The following code block consumes a ton of memory, which is strange as just using the SD File reader I have no memory issues. I think that the issue is related to the java garbage collection not being picked up, even though I have attempted to force that (to no success). All the following block does is iterate through an SD file and look for the highest (or lowest) scoring molecule for each molecule. The assumption is that all molecules of the same type will be next to each other in the file (which is not my problem). Running this on a SD file of around 400K molecules consumes around 23GB of memory, so if anyone has an idea I will be most appreciative! public static void main(String argv[]) throws IOException, InterruptedException { CommandLineParser cParser; String[] modes= {}; String[] parms= {-in, -filterTag, -direction, -out}; String[] reqParms = {-in, -filterTag, -direction, -out}; String rdkitSO = System.getenv(RDKIT_SO); System.load(rdkitSO); String currentDir = System.getProperty(user.dir); File dir = new File(currentDir); cParser = new CommandLineParser(EXPLAIN,0,0,argv,modes,parms,reqParms); ROMol rdmol = null; ROMol rdmol2 = null; SDMolSupplier suppl = new SDMolSupplier(cParser.getValue(-in)); SDWriter writer = new SDWriter(cParser.getValue(-out)); int count = 0; while (!suppl.atEnd()) { count++; if (count % 1000 == 0) { System.out.println(count); } rdmol = suppl.next(); if (rdmol2 == null) { // rdmol2.delete(); rdmol2 = new ROMol(rdmol); continue; } if (rdmol.MolToSmiles().equals(rdmol2.MolToSmiles())) { if ( cParser.getValue(-direction).equals(highest) ) { double value1 = Double.parseDouble(rdmol.getProp(cParser.getValue(-filterTag))); double value2 = Double.parseDouble(rdmol2.getProp(cParser.getValue(-filterTag))); //System.out.println(Val1 + value1 + Val2 + value2); if (value1 value2) { rdmol2.delete(); rdmol2 = new ROMol(rdmol); } } else { if ( Double.parseDouble(rdmol.getProp(cParser.getValue(-filterTag))) Double.parseDouble(rdmol2.getProp(cParser.getValue(-filterTag))) ) { rdmol2.delete(); rdmol2 = new ROMol(rdmol); } } } else { writer.write(rdmol2); rdmol2.delete(); rdmol2 = new ROMol(rdmol);
[Rdkit-discuss] Memory Issue
Hi all, I have had a strange issue that I can't seem to find a way around. The following code block consumes a ton of memory, which is strange as just using the SD File reader I have no memory issues. I think that the issue is related to the java garbage collection not being picked up, even though I have attempted to force that (to no success). All the following block does is iterate through an SD file and look for the highest (or lowest) scoring molecule for each molecule. The assumption is that all molecules of the same type will be next to each other in the file (which is not my problem). Running this on a SD file of around 400K molecules consumes around 23GB of memory, so if anyone has an idea I will be most appreciative! public static void main(String argv[]) throws IOException, InterruptedException { CommandLineParser cParser; String[] modes= {}; String[] parms= {-in, -filterTag, -direction, -out}; String[] reqParms = {-in, -filterTag, -direction, -out}; String rdkitSO = System.getenv(RDKIT_SO); System.load(rdkitSO); String currentDir = System.getProperty(user.dir); File dir = new File(currentDir); cParser = new CommandLineParser(EXPLAIN,0,0,argv,modes,parms,reqParms); ROMol rdmol = null; ROMol rdmol2 = null; SDMolSupplier suppl = new SDMolSupplier(cParser.getValue(-in)); SDWriter writer = new SDWriter(cParser.getValue(-out)); int count = 0; while (!suppl.atEnd()) { count++; if (count % 1000 == 0) { System.out.println(count); } rdmol = suppl.next(); if (rdmol2 == null) { // rdmol2.delete(); rdmol2 = new ROMol(rdmol); continue; } if (rdmol.MolToSmiles().equals(rdmol2.MolToSmiles())) { if ( cParser.getValue(-direction).equals(highest) ) { double value1 = Double.parseDouble(rdmol.getProp(cParser.getValue(-filterTag))); double value2 = Double.parseDouble(rdmol2.getProp(cParser.getValue(-filterTag))); //System.out.println(Val1 + value1 + Val2 + value2); if (value1 value2) { rdmol2.delete(); rdmol2 = new ROMol(rdmol); } } else { if ( Double.parseDouble(rdmol.getProp(cParser.getValue(-filterTag))) Double.parseDouble(rdmol2.getProp(cParser.getValue(-filterTag))) ) { rdmol2.delete(); rdmol2 = new ROMol(rdmol); } } } else { writer.write(rdmol2); rdmol2.delete(); rdmol2 = new ROMol(rdmol); } } } -- Don't Limit Your Business. Reach for the Cloud. GigeNET's Cloud Solutions provide you with the tools and support that you need to offload your IT needs and focus on growing your business. Configured For All Businesses. Start Your Cloud Today. https://www.gigenetcloud.com/___ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss