Hi all,

I am trying to calculate Morgan Fingerprints for approximately 100.000 Molecules. I do not want to use the folded FPs. I want to use the counts for the bits which are on and afterwards drop all the infrequent features. So I would ultimately need a dataframe with the counts for each molecule assigned to the respective bit of the FP like this:

Molecule        1
        2...
        6...
        ...n
Structure1
        0
        0
        4
        1
Structure2
        1
        0
        0
        8

The function I am currently using is:

def get_ecfp(mols, n=3, cut=10):

    df = []
    all_dfs = pd.DataFrame()
    for i, molin enumerate(mols):
        d = pd.DataFrame(AllChem.GetMorganFingerprint(mol, 
n).GetNonzeroElements(),index=[mols.index[i]], dtype='uint8')
        df.append(d)
        logger.debug('Calculating %s',str(i))

        
        #append all collected dataframes in between to prevent memory issue

 if i%20000 ==0 or i ==len(mols):
            logger.debug('Concatenation %s', str(i))
            part = pd.concat(df)
            all_dfs = pd.concat([part,all_dfs])
            logger.debug('Datatypes %s', str(all_dfs.dtypes))
            del part
            df = []

        all_dfs = all_dfs[all_dfs.columns[all_dfs.sum(axis=0) >= cut]] #drop 
columns where count<10

    return df

But the concatenation is awfully slow. Without the intermediate concatenation I am quickly running out of Memory trying to concatenate all dataframes, although using a Machine with 128GB of RAM.

I found the possibility to convert the fingerprint to a numpy array. That needs me to assign a numpy array with a certain length which is impossible, as I do not know how long the final array has to be. Assigning it to an array without predefining the length just never finishes the computation. If I check for the length of the fp with fp.GetLength() I get 4294967295 which just seems to be the maximum number of a 32bit int. This means that converting all of the FPs to such long numpy Arrays also is not really an option.

Is there any way which I did not see to get the desired DataFrame or ndarray out of RDKit directly? Or any better conversion? I assume that the keys of the dict I get with GetNonzeroElements() are the set bits of the 4294967295 bit long vector?

Thanks in advance!

Jennifer

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Reply via email to