On May 18, 2018, at 17:48, Jennifer Hemmerich <jennyhemmeric...@gmail.com> wrote: > I really liked the idea and I implemented it as follows:
> df = pd.DataFrame(columns=counts.keys()) > for i,fp in enumerate(allfps): > logger.debug('appending %s', str(i)) > df.append(fp, ignore_index=True) While this saves a bit over creating an (N=100,000 x 2**32) data frame, it still creates a dataframe with (N x #unique keys). This is probably where your memory is going. Your next operation selects only those with a given number of fingerprints in it, and throws away the rest of the data. > df = df[df.columns[df.sum(axis=0) >= cut]] What you can do instead is to select only those keys with at least cut elements, before making the DataFrame. One way to do that is to use the collections.Counter class, which is like a collections.defaultdict(int) with the addition that it implements a "most_common()" method. When called by with no arguments it returns the list of (key, value) pairs, sorted from largest value to smallest. That means you can use a loop like: selected_keys = [] for idx, count in counts.most_common(): if count < cut: break selected_keys.append(idx) to select only the keys which are at or above the cut value. Once you have those keys, you can create the rows which are then passed to the DataFrame. The following seems to do what you are looking for. At the very least, it gets you more of the way there. ============================ from rdkit.Chem import AllChem from collections import Counter import pandas as pd def get_ecfp(mols, n=3, cut=10): allfps = [] counts = Counter() for i, mol in enumerate(mols): fp = AllChem.GetMorganFingerprint(mol, n).GetNonzeroElements() allfps.append(fp) for idx, count in fp.items(): # use "+= count" for the most frequent feature" counts[idx] += count # use "+= 1" for the feature in the most fingerprints #counts[idx] += 1 selected_keys = [] for idx, count in counts.most_common(): if count < cut: break selected_keys.append(idx) rows = [] for fp in allfps: rows.append([fp.get(key, 0) for key in selected_keys]) df = pd.DataFrame( rows, columns=selected_keys) return df if __name__ == "__main__": mols = [] for smi in ("c1ccccc1O", "CNO", "C#N"): mol = AllChem.MolFromSmiles(smi) mols.append(mol) print(get_ecfp(mols, cut=2)) ============================ Note that I construct a list of rows which I then pass in to the DataFrame at once, rather then append() one row at a time as your code does. This is because the Pandas documentation says, at https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.append.html : Iteratively appending rows to a DataFrame can be more computationally intensive than a single concatenate. A better solution is to append those rows to a list and then concatenate the list with the original DataFrame all at once. Cheers, Andrew da...@dalkescientific.com ------------------------------------------------------------------------------ Check out the vibrant tech community on one of the world's most engaging tech sites, Slashdot.org! http://sdm.link/slashdot _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss