Hello everyone,
We have been writing a script that searches though a large number of molecules within different files for a common substructure. To speed this up we have been attempting to run this script in parallel-see scripts below. However online the tutorial notes make reference to problems with using the SDMolSupplier in parallel, we were wondering what is the issue and how we could circumvent them to speed up some of our calculations. Non-parallel from __future__ import print_function from rdkit import Chem import os from progressbar import ProgressBar pbar=ProgressBar() matches = [] directory = 'Q:\Data2' patt = Chem.MolFromSmarts('NC(N****NC=O)=O') for file in pbar(os.listdir(directory)): filename = os.fsdecode(file) if filename.endswith(".sdf"): f = os.path.join(directory,filename) suppl= Chem.SDMolSupplier(f) for mol in suppl: if mol is None: continue if mol.HasSubstructMatch(patt): matches.append(mol) w = Chem.SDWriter(r'C:\Users\tom.watts\Desktop\datasmarts4c.sdf') for m in matches: w.write(m) print(filename) Parallel pbar=ProgressBar() matches = [] directory = 'E:\Data' patt = Chem.MolFromSmarts('NC(N****NC=O)=O') w = Chem.SDWriter(r'C:\Users\tom.watts\Desktop\SearchDataNonly.sdf') l=[] for file in pbar(os.listdir(directory)): filename = os.fsdecode(file) if filename.endswith(".sdf"): f = os.path.join(directory,filename) l.append(f) num_cores = multiprocessing.cpu_count() print(num_cores) lock = multiprocessing.Lock() def Search(i): suppl= Chem.SDMolSupplier(i) for mol in suppl: if mol is None: continue if mol.HasSubstructMatch(patt): matches.append(mol) return matches results = Parallel(n_jobs=20)(delayed(Search)(i) for i in l) We also wish to use a second script that opens one SDF file and then runs a loop over each molecule in the file. This is currently done serially and we were wondering if it could be made parallel. suppl = Chem.SDMolSupplier('Red3.sdf') for mol in suppl: patt = Chem.MolFromSmarts('NC(N)=O') num=mol.GetSubstructMatches(patt) logger.debug(Chem.MolToSmiles(mol)) h=len(num) m3=Chem.AddHs(mol) cids =AllChem.EmbedMultipleConfs(m3, numConfs) Any comments can be useful. Thanks a lot, Stamatia Zavitsanou
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss