[Rdkit-discuss] Calculating MorganFingerprint Counts for large number of Molecules

Jennifer Hemmerich Fri, 18 May 2018 00:41:44 -0700

Hi all,

I am trying to calculate Morgan Fingerprints for approximately 100.000Molecules. I do not want to use the folded FPs. I want to use the countsfor the bits which are on and afterwards drop all the infrequentfeatures. So I would ultimately need a dataframe with the counts foreach molecule assigned to the respective bit of the FP like this:


Molecule        1
        2...
        6...
        ...n
Structure1
        0
        0
        4
        1
Structure2
        1
        0
        0
        8

The function I am currently using is:

def get_ecfp(mols, n=3, cut=10):

    df = []
    all_dfs = pd.DataFrame()
    for i, molin enumerate(mols):
        d = pd.DataFrame(AllChem.GetMorganFingerprint(mol, 
n).GetNonzeroElements(),index=[mols.index[i]], dtype='uint8')
        df.append(d)
        logger.debug('Calculating %s',str(i))

        
        #append all collected dataframes in between to prevent memory issue

 if i%20000 ==0 or i ==len(mols):
            logger.debug('Concatenation %s', str(i))
            part = pd.concat(df)
            all_dfs = pd.concat([part,all_dfs])
            logger.debug('Datatypes %s', str(all_dfs.dtypes))
            del part
            df = []

        all_dfs = all_dfs[all_dfs.columns[all_dfs.sum(axis=0) >= cut]] #drop 
columns where count<10

    return df

But the concatenation is awfully slow. Without the intermediateconcatenation I am quickly running out of Memory trying to concatenateall dataframes, although using a Machine with 128GB of RAM.

I found the possibility to convert the fingerprint to a numpy array.That needs me to assign a numpy array with a certain length which isimpossible, as I do not know how long the final array has to be.Assigning it to an array without predefining the length just neverfinishes the computation. If I check for the length of the fp withfp.GetLength() I get 4294967295 which just seems to be the maximumnumber of a 32bit int. This means that converting all of the FPs to suchlong numpy Arrays also is not really an option.

Is there any way which I did not see to get the desired DataFrame orndarray out of RDKit directly? Or any better conversion? I assume thatthe keys of the dict I get with GetNonzeroElements() are the set bits ofthe 4294967295 bit long vector?


Thanks in advance!

Jennifer

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot

_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

[Rdkit-discuss] Calculating MorganFingerprint Counts for large number of Molecules

Reply via email to