Re: [Rdkit-discuss] Calculating MorganFingerprint Counts for large number of Molecules

Andrew Dalke Fri, 18 May 2018 10:44:07 -0700

On May 18, 2018, at 17:48, Jennifer Hemmerich <jennyhemmeric...@gmail.com> 
wrote:
> I really liked the idea and I implemented it as follows:



>     df = pd.DataFrame(columns=counts.keys())
>     for i,fp in enumerate(allfps):
>         logger.debug('appending %s', str(i))
>         df.append(fp, ignore_index=True)


While this saves a bit over creating an (N=100,000 x 2**32) data frame, it 
still creates a dataframe with (N x #unique keys).

This is probably where your memory is going.

Your next operation selects only those with a given number of fingerprints in 
it, and throws away the rest of the data.

>     df = df[df.columns[df.sum(axis=0) >= cut]]

What you can do instead is to select only those keys with at least cut 
elements, before making the DataFrame.

One way to do that is to use the collections.Counter class, which is like a 
collections.defaultdict(int) with the addition that it implements a 
"most_common()" method.

When called by with no arguments it returns the list of (key, value) pairs, 
sorted from largest value to smallest.

That means you can use a loop like:

    selected_keys = []
    for idx, count in counts.most_common():
        if count < cut:
            break
        selected_keys.append(idx)

to select only the keys which are at or above the cut value.

Once you have those keys, you can create the rows which are then passed to the 
DataFrame.

The following seems to do what you are looking for. At the very least, it gets 
you more of the way there.


============================

from rdkit.Chem import AllChem

from collections import Counter
import pandas as pd

def get_ecfp(mols, n=3, cut=10):
    allfps = []
    counts = Counter()
    for i, mol in enumerate(mols):
        fp = AllChem.GetMorganFingerprint(mol, n).GetNonzeroElements()
        allfps.append(fp)
        for idx, count in fp.items():
            # use "+= count" for the most frequent feature"
            counts[idx] += count
            # use "+= 1" for the feature in the most fingerprints
            #counts[idx] += 1

    selected_keys = []
    for idx, count in counts.most_common():
        if count < cut:
            break
        selected_keys.append(idx)
        
    rows = []
    for fp in allfps:
        rows.append([fp.get(key, 0) for key in selected_keys])
        
    df = pd.DataFrame(
        rows,
        columns=selected_keys)

    return df

if __name__ == "__main__":
    mols = []
    for smi in ("c1ccccc1O", "CNO", "C#N"):
        mol = AllChem.MolFromSmiles(smi)
        mols.append(mol)

    print(get_ecfp(mols, cut=2))

============================

Note that I construct a list of rows which I then pass in to the DataFrame at 
once, rather then append() one row at a time as your code does.

This is because the Pandas documentation says, at 
https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html
 : Iteratively appending rows to a DataFrame can be more computationally 
intensive than a single concatenate. A better solution is to append those rows 
to a list and then concatenate the list with the original DataFrame all at once.


Cheers,


                                Andrew
                                da...@dalkescientific.com



------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] Calculating MorganFingerprint Counts for large number of Molecules

Reply via email to