Dear all, The changes to add support for a ForwardSDMolSupplier that can work with very large files and read directly from gzipped SD files have now been merged onto the trunk.
I also checked in modifications to the SDWriter, SmilesWriter, and TDTWriter classes so that they can now write to file-like objects as well as named files. This means you can directly generate gzipped SD files or SD text from within python. There is currently no support for a ForwardSmilesMolSupplier since that turns out to be more work than the ForwardSDMolSupplier, but if there is demand it can be added in the future. Best, -greg On Tue, Dec 6, 2011 at 5:05 PM, Andrew Dalke <[email protected]> wrote: > Hi Jean-Paul, > > > On Dec 6, 2011, at 5:00 PM, JP wrote: >> RDKit - v2011.09.01 - chokes on massive SDF files when using >> Chem.SDMolSupplier(input_file) > ... >> Has anyone else noticed this? Are there any known limitations? (buffer >> sizes etc maybe) > > This came up a couple of weeks ago on the list. The current reader does > tell()/seek() operations on the file, with a 32-bit integer. This can't > handle files larger than 2**32-1 bytes long. > > If you want a solution now, Greg wrote: > > On Nov 21, 2011, at 9:00 PM, Greg Landrum wrote: >> If you're willing to live on the bleeding edge for a bit, there's an >> RDKit branch that contains a new way of working with SD files that is >> well suited to dealing with large files: >> https://rdkit.svn.sourceforge.net/svnroot/rdkit/branches/StreambufSupport_18Nov2011 >> >> The new feature is the ForwardSDMolSupplier, this can be initialized >> from a filename: >> In [3]: suppl = Chem.ForwardSDMolSupplier('PubChemBackground.sdf') >> >> or a python file-like object: >> In [4]: suppl2 = Chem.ForwardSDMolSupplier(file('PubChemBackground.sdf')) >> >> You can read out molecules by looping over the supplier: >> In [5]: for mol in suppl2: >> ...: if mol is None: continue >> ...: print mol.GetNumAtoms() >> ...: >> 24 >> 17 >> .... >> >> Since these work using file-like objects, you can directly read from >> compressed files: >> >> In [6]: suppl3 = Chem.ForwardSDMolSupplier(gzip.open('bigfile.sdf.gz')) >> >> The differences to the standard SDMolSupplier : >> - the ForwardSDMolSupplier is not random access; you cannot ask for >> a particular item >> - there's no reset method, if you want to go through the molecules >> more than once, you have to create the supplier from scratch. >> >> Coincidentally, this was inspired by some suggestions Andrew has made >> in the last week or so. >> >> I will be merging this branch back into the trunk sometime in the next >> week, but the code is there, mostly tested, and usable now. > > > > > > Andrew > [email protected] > > > > ------------------------------------------------------------------------------ > Cloud Services Checklist: Pricing and Packaging Optimization > This white paper is intended to serve as a reference, checklist and point of > discussion for anyone considering optimizing the pricing and packaging model > of a cloud services business. Read Now! > http://www.accelacomm.com/jaw/sfnl/114/51491232/ > _______________________________________________ > Rdkit-discuss mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/rdkit-discuss ------------------------------------------------------------------------------ Cloud Services Checklist: Pricing and Packaging Optimization This white paper is intended to serve as a reference, checklist and point of discussion for anyone considering optimizing the pricing and packaging model of a cloud services business. Read Now! http://www.accelacomm.com/jaw/sfnl/114/51491232/ _______________________________________________ Rdkit-discuss mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

