Re: [Rdkit-discuss] Using the RDKit with Dask

Greg Landrum Mon, 22 Mar 2021 06:57:59 -0700

Hi Pat,

Solution, either change your calc_bcut function to:
def calc_bcut(smi):
    from rdkit.Chem.rdMolDescriptors import BCUT2D
    mol = Chem.MolFromSmiles(smi)
    return BCUT2D(mol)


or change the import on line 8 at the top to:
from rdkit.Chem import rdMolDescriptors

and do:
def calc_bcut(smi):
    mol = Chem.MolFromSmiles(smi)
    return rdMolDescriptors.BCUT2D(mol)

The second approach is probably more efficient.

I'm not 100% sure what's happening, but it looks like dask is trying to
somehow package up whatever is being used in calc_bcut() and is having a
problem when it sees the BCUT2D object, which is a Boost.Python.function
instead of a normal Python function:

In [3]: type(MolWt)
Out[3]: function

In [4]: type(BCUT2D)
Out[4]: Boost.Python.function

By either explicitly doing the import in calc_bcut() or referencing the
function through the module, dask seems to be able to figure out how to do
the right thing.

-greg
p.s. in case you see different behavior:
In [2]: dask.__version__
Out[2]: '2020.12.0'




On Mon, Mar 22, 2021 at 1:51 PM Patrick Walters <[email protected]> wrote:

> Apologies, there was a bug in the code I sent in my previous message.  The
> problem is the same.  Here is the corrected code in a gist.
>
> https://gist.github.com/PatWalters/ca41289a6990ebf7af1e5c44e188fccd
>
>
>
> On Mon, Mar 22, 2021 at 8:16 AM Patrick Walters <[email protected]>
> wrote:
>
>> Hi All,
>>
>> I've been trying to calculate BCUT2D descriptors in parallel with Dask
>> and get this error with the code below.
>> TypeError: cannot pickle 'Boost.Python.function' object
>>
>> Everything works if I call mw_df, which calculates molecular weight, but
>> I get the error above if I call bcut_df.  Does anyone have a workaround?
>>
>> Thanks,
>>
>> Pat
>>
>> #!/usr/bin/env python
>>
>> import sys
>> import dask.dataframe as dd
>> import pandas as pd
>> from rdkit import Chem
>> from rdkit.Chem.Descriptors import MolWt
>> from rdkit.Chem.rdMolDescriptors import BCUT2D
>> import time
>>
>> # --  molecular weight functions
>> def calc_mw(smi):
>>     mol = Chem.MolFromSmiles(smi)
>>     return MolWt(mol)
>>
>> def mw_df(df):
>>     return df.SMILES.apply(calc_mw)
>>
>> # -- bcut functions
>> def bcut_df(df):
>>     return df.apply(calc_bcut)
>>
>> def calc_bcut(smi):
>>     mol = Chem.MolFromSmiles(smi)
>>     return BCUT2D(mol)
>>
>> def main():
>>     start = time.time()
>>     df = pd.read_csv(sys.argv[1],sep=" ",names=["SMILES","Name"])
>>     ddf = dd.from_pandas(df,npartitions=16)
>>     ddf['MW'] =
>> ddf.map_partitions(mw_df,meta='float').compute(scheduler='processes')
>>     ddf['BCUT'] =
>> ddf.map_partitions(bcut_df,meta='float').compute(scheduler='processes')
>>     print(time.time()-start)
>>     print(ddf.head())
>>
>>
>> if __name__ == "__main__":
>>     main()
>>
> _______________________________________________
> Rdkit-discuss mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Using the RDKit with Dask

Reply via email to