Hi Tim,

  You might also consider using chemfp, which has this sort of functionality 
available through its toolkit wrapper API:

from chemfp import rdkit_toolkit as T
import itertools

with T.read_ids_and_molecules("chembl_28.sdf.gz") as reader:
  loc = reader.location
  for id, mol in itertools.islice(reader, 5):
    print(f"Record: {loc.recno} ({id}) line: {loc.lineno} offsets: 
{loc.offsets}")
    counts_line = loc.record.splitlines()[3]
    num_atoms, num_bonds = int(counts_line[:3]), int(counts_line[3:6])
    print(f"  counts line #atoms: {num_atoms} #bonds: {num_bonds}")
    print(f"        RDKit #atoms: {mol.GetNumAtoms()} #bonds: 
{mol.GetNumBonds()}")

The output in this case is:

Record: 1 (CHEMBL153534) line: 1 offsets: (0, 1458)
  counts line #atoms: 16 #bonds: 17
        RDKit #atoms: 16 #bonds: 17
Record: 2 (CHEMBL440060) line: 43 offsets: (1458, 18699)
  counts line #atoms: 206 #bonds: 208
        RDKit #atoms: 202 #bonds: 204
Record: 3 (CHEMBL440245) line: 466 offsets: (18699, 39688)
  counts line #atoms: 251 #bonds: 254
        RDKit #atoms: 251 #bonds: 254
Record: 4 (CHEMBL440249) line: 980 offsets: (39688, 56050)
  counts line #atoms: 194 #bonds: 205
        RDKit #atoms: 185 #bonds: 196
Record: 5 (CHEMBL405398) line: 1388 offsets: (56050, 58447)
  counts line #atoms: 27 #bonds: 30
        RDKit #atoms: 27 #bonds: 30

You can also work more directly to the record tokenization level, and pass each 
record to the rdkit_toolkit wrapper:


from chemfp import text_toolkit

with text_toolkit.read_sdf_records("chembl_28.sdf.gz") as reader:
  for rec in itertools.islice(reader, 5):
    mol = T.parse_molecule(rec, "sdf")
    print(mol.GetProp("chembl_id"), "has", len(rec), "bytes")

which prints

CHEMBL153534 has 1458 bytes
CHEMBL440060 has 17241 bytes
CHEMBL440245 has 20989 bytes
CHEMBL440249 has 16362 bytes
CHEMBL405398 has 2397 bytes



                                Andrew
                                da...@dalkescientific.com

> On Nov 4, 2021, at 17:55, Tim Dudgeon <tdudgeon...@gmail.com> wrote:
> 
> Thanks Paolo, that's fantastic.
> The first option was what I needed.
> Tim
> 
> On Thu, Nov 4, 2021 at 4:36 PM Paolo Tosco <paolo.tosco.m...@gmail.com> wrote:
> Hi Tim,
> 
> if you need access to the original text, you'll have to do the chunking 
> yourself, e.g.:
> 
> import gzip
> 
> def molgen(hnd):
>     mol_text_tmp = ""
>     while 1:
>         line = hnd.readline()
>         if not line:
>             return
>         line = line.decode("utf-8")
>         mol_text_tmp += line
>         if line.startswith("$$$$"):
>             mol_text = mol_text_tmp
>             mol_text_tmp = ""
>             yield mol_text






_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to