> On Mar 13, 2021, at 20:29, Marawan Hussien via Rdkit-discuss > <rdkit-discuss@lists.sourceforge.net> wrote: > my question is if this is the valid approach of comparison, particularly if > the class sizes vary widely and the average similarity will be inevitably > affected by the size of each item in each pair. As a check, it looks that the > diagonal is having the highest inter-classes similarity overall, which is > anyway expected. > > I am also wondering if a size-weighted normalization approach could handle > this situation?
What about a Z-score? That is: zscore = (score - background_score) / background_standard_deviation rather than using the mean score. I worked out something like that a few years ago, using chemfp, at http://www.dalkescientific.com/writings/diary/archive/2017/03/27/chembl_target_sets_association_network.html . If that's a reasonable approach, then it could all be done in RDKit, if you don't want to use chemfp. Best regards, Andrew da...@dalkescientific.com _______________________________________________ Rdkit-discuss mailing list Rdkit-discuss@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/rdkit-discuss