Hi Tim, You might also consider using chemfp, which has this sort of functionality available through its toolkit wrapper API:
from chemfp import rdkit_toolkit as T import itertools with T.read_ids_and_molecules("chembl_28.sdf.gz") as reader: loc = reader.location for id, mol in itertools.islice(reader, 5): print(f"Record: {loc.recno} ({id}) line: {loc.lineno} offsets: {loc.offsets}") counts_line = loc.record.splitlines()[3] num_atoms, num_bonds = int(counts_line[:3]), int(counts_line[3:6]) print(f" counts line #atoms: {num_atoms} #bonds: {num_bonds}") print(f" RDKit #atoms: {mol.GetNumAtoms()} #bonds: {mol.GetNumBonds()}") The output in this case is: Record: 1 (CHEMBL153534) line: 1 offsets: (0, 1458) counts line #atoms: 16 #bonds: 17 RDKit #atoms: 16 #bonds: 17 Record: 2 (CHEMBL440060) line: 43 offsets: (1458, 18699) counts line #atoms: 206 #bonds: 208 RDKit #atoms: 202 #bonds: 204 Record: 3 (CHEMBL440245) line: 466 offsets: (18699, 39688) counts line #atoms: 251 #bonds: 254 RDKit #atoms: 251 #bonds: 254 Record: 4 (CHEMBL440249) line: 980 offsets: (39688, 56050) counts line #atoms: 194 #bonds: 205 RDKit #atoms: 185 #bonds: 196 Record: 5 (CHEMBL405398) line: 1388 offsets: (56050, 58447) counts line #atoms: 27 #bonds: 30 RDKit #atoms: 27 #bonds: 30 You can also work more directly to the record tokenization level, and pass each record to the rdkit_toolkit wrapper: from chemfp import text_toolkit with text_toolkit.read_sdf_records("chembl_28.sdf.gz") as reader: for rec in itertools.islice(reader, 5): mol = T.parse_molecule(rec, "sdf") print(mol.GetProp("chembl_id"), "has", len(rec), "bytes") which prints CHEMBL153534 has 1458 bytes CHEMBL440060 has 17241 bytes CHEMBL440245 has 20989 bytes CHEMBL440249 has 16362 bytes CHEMBL405398 has 2397 bytes Andrew da...@dalkescientific.com > On Nov 4, 2021, at 17:55, Tim Dudgeon <tdudgeon...@gmail.com> wrote: > > Thanks Paolo, that's fantastic. > The first option was what I needed. > Tim > > On Thu, Nov 4, 2021 at 4:36 PM Paolo Tosco <paolo.tosco.m...@gmail.com> wrote: > Hi Tim, > > if you need access to the original text, you'll have to do the chunking > yourself, e.g.: > > import gzip > > def molgen(hnd): > mol_text_tmp = "" > while 1: > line = hnd.readline() > if not line: > return > line = line.decode("utf-8") > mol_text_tmp += line > if line.startswith("$$$$"): > mol_text = mol_text_tmp > mol_text_tmp = "" > yield mol_text _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss