Thanks, Greg. Yutong Zhao sent me the same solution and I was just about to post his fix to the list. It's funny how I posted to the list and a colleague had the answer.
Thanks all, the RDKit community is awesome! On Mon, Mar 22, 2021 at 9:55 AM Greg Landrum <greg.land...@gmail.com> wrote: > Hi Pat, > > Solution, either change your calc_bcut function to: > def calc_bcut(smi): > from rdkit.Chem.rdMolDescriptors import BCUT2D > mol = Chem.MolFromSmiles(smi) > return BCUT2D(mol) > > or change the import on line 8 at the top to: > from rdkit.Chem import rdMolDescriptors > > and do: > def calc_bcut(smi): > mol = Chem.MolFromSmiles(smi) > return rdMolDescriptors.BCUT2D(mol) > > The second approach is probably more efficient. > > I'm not 100% sure what's happening, but it looks like dask is trying to > somehow package up whatever is being used in calc_bcut() and is having a > problem when it sees the BCUT2D object, which is a Boost.Python.function > instead of a normal Python function: > > In [3]: type(MolWt) > Out[3]: function > > In [4]: type(BCUT2D) > Out[4]: Boost.Python.function > > By either explicitly doing the import in calc_bcut() or referencing the > function through the module, dask seems to be able to figure out how to do > the right thing. > > -greg > p.s. in case you see different behavior: > In [2]: dask.__version__ > Out[2]: '2020.12.0' > > > > > On Mon, Mar 22, 2021 at 1:51 PM Patrick Walters <wpwalt...@gmail.com> > wrote: > >> Apologies, there was a bug in the code I sent in my previous message. >> The problem is the same. Here is the corrected code in a gist. >> >> https://gist.github.com/PatWalters/ca41289a6990ebf7af1e5c44e188fccd >> >> >> >> On Mon, Mar 22, 2021 at 8:16 AM Patrick Walters <wpwalt...@gmail.com> >> wrote: >> >>> Hi All, >>> >>> I've been trying to calculate BCUT2D descriptors in parallel with Dask >>> and get this error with the code below. >>> TypeError: cannot pickle 'Boost.Python.function' object >>> >>> Everything works if I call mw_df, which calculates molecular weight, but >>> I get the error above if I call bcut_df. Does anyone have a workaround? >>> >>> Thanks, >>> >>> Pat >>> >>> #!/usr/bin/env python >>> >>> import sys >>> import dask.dataframe as dd >>> import pandas as pd >>> from rdkit import Chem >>> from rdkit.Chem.Descriptors import MolWt >>> from rdkit.Chem.rdMolDescriptors import BCUT2D >>> import time >>> >>> # -- molecular weight functions >>> def calc_mw(smi): >>> mol = Chem.MolFromSmiles(smi) >>> return MolWt(mol) >>> >>> def mw_df(df): >>> return df.SMILES.apply(calc_mw) >>> >>> # -- bcut functions >>> def bcut_df(df): >>> return df.apply(calc_bcut) >>> >>> def calc_bcut(smi): >>> mol = Chem.MolFromSmiles(smi) >>> return BCUT2D(mol) >>> >>> def main(): >>> start = time.time() >>> df = pd.read_csv(sys.argv[1],sep=" ",names=["SMILES","Name"]) >>> ddf = dd.from_pandas(df,npartitions=16) >>> ddf['MW'] = >>> ddf.map_partitions(mw_df,meta='float').compute(scheduler='processes') >>> ddf['BCUT'] = >>> ddf.map_partitions(bcut_df,meta='float').compute(scheduler='processes') >>> print(time.time()-start) >>> print(ddf.head()) >>> >>> >>> if __name__ == "__main__": >>> main() >>> >> _______________________________________________ >> Rdkit-discuss mailing list >> Rdkit-discuss@lists.sourceforge.net >> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss >> >
_______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss