Kirk,

On Mon, Nov 21, 2011 at 8:42 PM, Robert DeLisle <rkdeli...@gmail.com> wrote:
>
> In thinking about this, an unsigned 32-bit integer should give me over 4
> billion values, and a signed 32-bit gives 2 billion.  I know that the file
> has slightly over 5 million structures and ~300 million lines.  Neither of
> these is over the limit, so I wouldn't expect an overflow.

The determining factor is, unfortunately, the file size, not the
number of lines.

If you're willing to live on the bleeding edge for a bit, there's an
RDKit branch that contains a new way of working with SD files that is
well suited to dealing with large files:
https://rdkit.svn.sourceforge.net/svnroot/rdkit/branches/StreambufSupport_18Nov2011

The new feature is the ForwardSDMolSupplier, this can be initialized
from a filename:
In [3]: suppl = Chem.ForwardSDMolSupplier('PubChemBackground.sdf')

or a python file-like object:
In [4]: suppl2 = Chem.ForwardSDMolSupplier(file('PubChemBackground.sdf'))

You can read out molecules by looping over the supplier:
In [5]: for mol in suppl2:
   ...:     if mol is None: continue
   ...:     print mol.GetNumAtoms()
   ...:
24
17
 ....

Since these work using file-like objects, you can directly read from
compressed files:

In [6]: suppl3  = Chem.ForwardSDMolSupplier(gzip.open('bigfile.sdf.gz'))

The differences to the standard SDMolSupplier :
  - the ForwardSDMolSupplier is not random access; you cannot ask for
a particular item
  - there's no reset method, if you want to go through the molecules
more than once, you have to create the supplier from scratch.

Coincidentally, this was inspired by some suggestions Andrew has made
in the last week or so.

I will be merging this branch back into the trunk sometime in the next
week, but the code is there, mostly tested, and usable now.

-greg

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure 
contains a definitive record of customers, application performance, 
security threats, fraudulent activity, and more. Splunk takes this 
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to