Andrew - thank you for the clarification. Obviously a character offset
into the file makes much more sense than a line offset. oops. 8^)
Greg - thanks for the link. I may give that a try. I have a different
approach in place now, so this file is taken care of. I genuinely hope I
don't have to process this many structures too often 8^) but I'll
certainly give the ForwardSDMolSupplier a try just in case I do.
On Mon, Nov 21, 2011 at 1:00 PM, Greg Landrum <greg.land...@gmail.com>wrote:
> Kirk,
>
> On Mon, Nov 21, 2011 at 8:42 PM, Robert DeLisle <rkdeli...@gmail.com>
> wrote:
> >
> > In thinking about this, an unsigned 32-bit integer should give me over 4
> > billion values, and a signed 32-bit gives 2 billion. I know that the
> file
> > has slightly over 5 million structures and ~300 million lines. Neither
> of
> > these is over the limit, so I wouldn't expect an overflow.
>
> The determining factor is, unfortunately, the file size, not the
> number of lines.
>
> If you're willing to live on the bleeding edge for a bit, there's an
> RDKit branch that contains a new way of working with SD files that is
> well suited to dealing with large files:
>
> https://rdkit.svn.sourceforge.net/svnroot/rdkit/branches/StreambufSupport_18Nov2011
>
> The new feature is the ForwardSDMolSupplier, this can be initialized
> from a filename:
> In [3]: suppl = Chem.ForwardSDMolSupplier('PubChemBackground.sdf')
>
> or a python file-like object:
> In [4]: suppl2 = Chem.ForwardSDMolSupplier(file('PubChemBackground.sdf'))
>
> You can read out molecules by looping over the supplier:
> In [5]: for mol in suppl2:
> ...: if mol is None: continue
> ...: print mol.GetNumAtoms()
> ...:
> 24
> 17
> ....
>
> Since these work using file-like objects, you can directly read from
> compressed files:
>
> In [6]: suppl3 = Chem.ForwardSDMolSupplier(gzip.open('bigfile.sdf.gz'))
>
> The differences to the standard SDMolSupplier :
> - the ForwardSDMolSupplier is not random access; you cannot ask for
> a particular item
> - there's no reset method, if you want to go through the molecules
> more than once, you have to create the supplier from scratch.
>
> Coincidentally, this was inspired by some suggestions Andrew has made
> in the last week or so.
>
> I will be merging this branch back into the trunk sometime in the next
> week, but the code is there, mostly tested, and usable now.
>
> -greg
>
------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure
contains a definitive record of customers, application performance,
security threats, fraudulent activity, and more. Splunk takes this
data and makes sense of it. IT sense. And common sense.
http://p.sf.net/sfu/splunk-novd2d
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss