Re: [Rdkit-discuss] Calculating MorganFingerprint Counts for large number of Molecules

Greg Landrum Fri, 18 May 2018 02:02:06 -0700

Hi Jennifer,

I think what you're going to want to do is start by getting the indices and
counts that are present across the entire dataset and then using those to
decide what the columns in your pandas table will be. Here's some code that
may help get you started:


from collections import defaultdict

def get_ecfp(mols, n=3, cut=10):
    allfps = []
    counts = defaultdict(int)
    for i, mol in enumerate(mols):
        fp = AllChem.GetMorganFingerprint(mol, n).GetNonzeroElements()
        allfps.append(fp)
        for idx,count in fp.items():
            counts[idx] += count
    print(counts)

If that's not enough, let me know and I will try to do a full version
tomorrow.

-greg




On Fri, May 18, 2018 at 9:41 AM Jennifer Hemmerich <
jennyhemmeric...@gmail.com> wrote:

> Hi all,
>
> I am trying to calculate Morgan Fingerprints for approximately 100.000
> Molecules. I do not want to use the folded FPs. I want to use the counts
> for the bits which are on and afterwards drop all the infrequent features.
> So I would ultimately need a dataframe with the counts for each molecule
> assigned to the respective bit of the FP like this:
> Molecule 1
> 2...
> 6...
> ...n
> Structure1
> 0
> 0
> 4
> 1
> Structure2
> 1
> 0
> 0
> 8
>
> The function I am currently using is:
>
> def get_ecfp(mols, n=3, cut=10):
>
>     df = []
>     all_dfs = pd.DataFrame()
>     for i, mol in enumerate(mols):
>         d = pd.DataFrame(AllChem.GetMorganFingerprint(mol, 
> n).GetNonzeroElements(),index=[mols.index[i]], dtype='uint8')
>         df.append(d)
>         logger.debug('Calculating %s',str(i))
>
>       
>       #append all collected dataframes in between to prevent memory issue
>
>       if i%20000 == 0 or i == len(mols):
>             logger.debug('Concatenation %s', str(i))
>             part = pd.concat(df)
>             all_dfs = pd.concat([part,all_dfs])
>             logger.debug('Datatypes %s', str(all_dfs.dtypes))
>             del part
>             df = []
>
>         all_dfs = all_dfs[all_dfs.columns[all_dfs.sum(axis=0) >= cut]] #drop 
> columns where count<10
>
>     return df
>
> But the concatenation is awfully slow. Without the intermediate
> concatenation I am quickly running out of Memory trying to concatenate all
> dataframes, although using a Machine with 128GB of RAM.
>
> I found the possibility to convert the fingerprint to a numpy array. That
> needs me to assign a numpy array with a certain length which is impossible,
> as I do not know how long the final array has to be. Assigning it to an
> array without predefining the length just never finishes the computation.
> If I check for the length of the fp with fp.GetLength() I get 4294967295
> which just seems to be the maximum number of a 32bit int. This means that
> converting all of the FPs to such long numpy Arrays also is not really an
> option.
>
> Is there any way which I did not see to get the desired DataFrame or
> ndarray out of RDKit directly? Or any better conversion? I assume that the
> keys of the dict I get with  GetNonzeroElements() are the set bits of the
> 4294967295 bit long vector?
>
> Thanks in advance!
>
> Jennifer
>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Calculating MorganFingerprint Counts for large number of Molecules

Reply via email to