Hi all,
I am trying to calculate Morgan Fingerprints for approximately 100.000
Molecules. I do not want to use the folded FPs. I want to use the counts
for the bits which are on and afterwards drop all the infrequent
features. So I would ultimately need a dataframe with the counts for
each molecule assigned to the respective bit of the FP like this:
Molecule 1
2...
6...
...n
Structure1
0
0
4
1
Structure2
1
0
0
8
The function I am currently using is:
def get_ecfp(mols, n=3, cut=10):
df = []
all_dfs = pd.DataFrame()
for i, molin enumerate(mols):
d = pd.DataFrame(AllChem.GetMorganFingerprint(mol,
n).GetNonzeroElements(),index=[mols.index[i]], dtype='uint8')
df.append(d)
logger.debug('Calculating %s',str(i))
#append all collected dataframes in between to prevent memory issue
if i%20000 ==0 or i ==len(mols):
logger.debug('Concatenation %s', str(i))
part = pd.concat(df)
all_dfs = pd.concat([part,all_dfs])
logger.debug('Datatypes %s', str(all_dfs.dtypes))
del part
df = []
all_dfs = all_dfs[all_dfs.columns[all_dfs.sum(axis=0) >= cut]] #drop
columns where count<10
return df
But the concatenation is awfully slow. Without the intermediate
concatenation I am quickly running out of Memory trying to concatenate
all dataframes, although using a Machine with 128GB of RAM.
I found the possibility to convert the fingerprint to a numpy array.
That needs me to assign a numpy array with a certain length which is
impossible, as I do not know how long the final array has to be.
Assigning it to an array without predefining the length just never
finishes the computation. If I check for the length of the fp with
fp.GetLength() I get 4294967295 which just seems to be the maximum
number of a 32bit int. This means that converting all of the FPs to such
long numpy Arrays also is not really an option.
Is there any way which I did not see to get the desired DataFrame or
ndarray out of RDKit directly? Or any better conversion? I assume that
the keys of the dict I get with GetNonzeroElements() are the set bits of
the 4294967295 bit long vector?
Thanks in advance!
Jennifer
------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss