Kirk, On Mon, Nov 21, 2011 at 8:42 PM, Robert DeLisle <rkdeli...@gmail.com> wrote: > > In thinking about this, an unsigned 32-bit integer should give me over 4 > billion values, and a signed 32-bit gives 2 billion. I know that the file > has slightly over 5 million structures and ~300 million lines. Neither of > these is over the limit, so I wouldn't expect an overflow.
The determining factor is, unfortunately, the file size, not the number of lines. If you're willing to live on the bleeding edge for a bit, there's an RDKit branch that contains a new way of working with SD files that is well suited to dealing with large files: https://rdkit.svn.sourceforge.net/svnroot/rdkit/branches/StreambufSupport_18Nov2011 The new feature is the ForwardSDMolSupplier, this can be initialized from a filename: In [3]: suppl = Chem.ForwardSDMolSupplier('PubChemBackground.sdf') or a python file-like object: In [4]: suppl2 = Chem.ForwardSDMolSupplier(file('PubChemBackground.sdf')) You can read out molecules by looping over the supplier: In [5]: for mol in suppl2: ...: if mol is None: continue ...: print mol.GetNumAtoms() ...: 24 17 .... Since these work using file-like objects, you can directly read from compressed files: In [6]: suppl3 = Chem.ForwardSDMolSupplier(gzip.open('bigfile.sdf.gz')) The differences to the standard SDMolSupplier : - the ForwardSDMolSupplier is not random access; you cannot ask for a particular item - there's no reset method, if you want to go through the molecules more than once, you have to create the supplier from scratch. Coincidentally, this was inspired by some suggestions Andrew has made in the last week or so. I will be merging this branch back into the trunk sometime in the next week, but the code is there, mostly tested, and usable now. -greg ------------------------------------------------------------------------------ All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity, and more. Splunk takes this data and makes sense of it. IT sense. And common sense. http://p.sf.net/sfu/splunk-novd2d _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss