Re: [Rdkit-discuss] Clustering

2022-05-02 Thread Tristan Camilleri
Dear Giovanni,

Many thanks for the feedback which is very insightful. I had skipped Greg
Landrum’s post on sphere exclusion clustering, I will certainly give it a
try.

Regards,
Tristan

On Mon, 2 May 2022 at 09:45, Giovanni Tricarico 
wrote:

> Hello Tristan,
>
> I imagine you have seen Greg Landrum’s post on sphere exclusion
> clustering:
> https://greglandrum.github.io/rdkit-blog/similarity/tutorial/2020/11/18/sphere-exclusion-clustering.html
>
> The initial MaxMin selection is very fast, especially Roger Sayle’s
> implementation, and not memory-hungry, and that gives you maximally diverse
> cluster centres (or ‘representatives’ as you call them).
>
> [Yes, it’s true that the centres can be ‘weird’, but if they *are* in
> your set, maybe the issue is in the composition of the set itself, not in
> what you do with it; maybe you should remove singletons (initially picked
> molecules that have Tanimoto distance 1 or very close to 1 from all other
> molecules) upfront].
>
> Unless I am missing something, it sounds surprising that you cannot use
> BulkTanimotoSimilarity to do the clustering.
>
> The size of the similarity matrix is not (4M)^2, but only ~4M *
> len(picks); so unless you picked almost everything…(?)
>
> In fact to counteract even this potential issue, we made our own simple
> variation of the above code that only processes one molecule at a time, and
> applied it without issues to a 1M set; surely you should be able to make a
> temporary list of floating point numbers of length len(pick).
>
>
>
> On the other hand, like others also commented, it’s not clear why you want
> to do the clustering in the first place.
>
> Searching by fingerprint similarity in a set of 4 M molecules should not
> be that slow.
>
>
>
> *As for:*
>
> *“I am now working with fpsim2 and chemfp to get a distance matrix (sparse
> matrix)”*
>
>
>
> IMO you do not necessarily need to have a full N^2 distance matrix to do
> what you describe, as indeed shown by Greg’s post.
>
> And more importantly, a distance matrix for a molecular set is rarely
> ‘sparse’ (where ‘sparse’ in this context means mostly full of 0’s, and with
> only few non-0 elements), because that only happens when most molecules
> have identical FP’s, which should be very rare, unless your set is very
> redundant.
>
>
>
> In this sense, I would suggest an alternative clustering approach (if you
> are convinced that you *do* want to do clustering).
>
> Make a sparse feature matrix (binary matrix encoding the ‘on’ FP bits of
> your molecules vs their index).
>
> From that, make not a *distance*, but a (Tanimoto) *similarity* matrix,
> with a threshold below which it is capped to 0 (I can only suggest R
> package proxyC for this
> https://cran.r-project.org/web/packages/proxyC/index.html; maybe there is
> a python implementation, I don’t know).
>
> That matrix can indeed be very sparse: only molecules that have a
> similarity higher than your threshold have a value, the rest is all 0 and
> is not stored. Obviously you should not use too low a threshold, otherwise
> you are back to a dense (4M)^2 matrix, which I assume you cannot make or
> store.
>
> From this (which is essentially an *adjacency* matrix - telling you which
> molecules are or are not ‘linked’) you can do graph representations and
> even clustering, e.g. using igraph (again, I can only suggest an R package
> for that https://igraph.org/r/, https://kateto.net/netscix2016.html).
>
>
>
> Hope this helps.
>
> Giovanni
>
>
>
> *From:* Tristan Camilleri 
> *Sent:* 02 May 2022 07:03
> *To:* Patrick Walters 
> *Cc:* RDKit Discuss 
> *Subject:* Re: [Rdkit-discuss] Clustering
>
>
>
> Some people who received this message don't often get email from
> tristan.camilleri...@um.edu.mt. Learn why this is important
> <http://aka.ms/LearnAboutSenderIdentification>
>
> Thanks for the feedback. Rather than an explicit need to perform
> clustering, it is more for me to learn how to do it.
>
>
>
> Any pointers to this effect would be greatly appreciated.
>
>
>
> Tristan
>
>
>
> On Sun, 1 May 2022 at 18:18, Patrick Walters  wrote:
>
> Similarity search on a database of 4 million is pretty quick with ChemFp
> or fpsim2.  Do you need to do the clustering?
>
>
>
> Here are a couple of relevant blog posts.
>
>
>
>
> http://practicalcheminformatics.blogspot.com/2020/10/what-do-molecules-that-look-like-this.html
> <https://eur05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpracticalcheminformatics.blogspot.com%2F2020%2F10%2Fwhat-do-molecules-that-look-like-this.html=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bcc

Re: [Rdkit-discuss] Clustering

2022-05-02 Thread Giovanni Tricarico
Hello Tristan,
I imagine you have seen Greg Landrum's post on sphere exclusion clustering: 
https://greglandrum.github.io/rdkit-blog/similarity/tutorial/2020/11/18/sphere-exclusion-clustering.html
The initial MaxMin selection is very fast, especially Roger Sayle's 
implementation, and not memory-hungry, and that gives you maximally diverse 
cluster centres (or 'representatives' as you call them).
[Yes, it's true that the centres can be 'weird', but if they are in your set, 
maybe the issue is in the composition of the set itself, not in what you do 
with it; maybe you should remove singletons (initially picked molecules that 
have Tanimoto distance 1 or very close to 1 from all other molecules) upfront].
Unless I am missing something, it sounds surprising that you cannot use 
BulkTanimotoSimilarity to do the clustering.
The size of the similarity matrix is not (4M)^2, but only ~4M * len(picks); so 
unless you picked almost everything...(?)
In fact to counteract even this potential issue, we made our own simple 
variation of the above code that only processes one molecule at a time, and 
applied it without issues to a 1M set; surely you should be able to make a 
temporary list of floating point numbers of length len(pick).

On the other hand, like others also commented, it's not clear why you want to 
do the clustering in the first place.
Searching by fingerprint similarity in a set of 4 M molecules should not be 
that slow.

As for:
"I am now working with fpsim2 and chemfp to get a distance matrix (sparse 
matrix)"

IMO you do not necessarily need to have a full N^2 distance matrix to do what 
you describe, as indeed shown by Greg's post.
And more importantly, a distance matrix for a molecular set is rarely 'sparse' 
(where 'sparse' in this context means mostly full of 0's, and with only few 
non-0 elements), because that only happens when most molecules have identical 
FP's, which should be very rare, unless your set is very redundant.

In this sense, I would suggest an alternative clustering approach (if you are 
convinced that you do want to do clustering).
Make a sparse feature matrix (binary matrix encoding the 'on' FP bits of your 
molecules vs their index).
>From that, make not a distance, but a (Tanimoto) similarity matrix, with a 
>threshold below which it is capped to 0 (I can only suggest R package proxyC 
>for this https://cran.r-project.org/web/packages/proxyC/index.html; maybe 
>there is a python implementation, I don't know).
That matrix can indeed be very sparse: only molecules that have a similarity 
higher than your threshold have a value, the rest is all 0 and is not stored. 
Obviously you should not use too low a threshold, otherwise you are back to a 
dense (4M)^2 matrix, which I assume you cannot make or store.
>From this (which is essentially an adjacency matrix - telling you which 
>molecules are or are not 'linked') you can do graph representations and even 
>clustering, e.g. using igraph (again, I can only suggest an R package for that 
>https://igraph.org/r/, https://kateto.net/netscix2016.html).

Hope this helps.
Giovanni

From: Tristan Camilleri 
Sent: 02 May 2022 07:03
To: Patrick Walters 
Cc: RDKit Discuss 
Subject: Re: [Rdkit-discuss] Clustering

Some people who received this message don't often get email from 
tristan.camilleri...@um.edu.mt<mailto:tristan.camilleri...@um.edu.mt>. Learn 
why this is important<http://aka.ms/LearnAboutSenderIdentification>
Thanks for the feedback. Rather than an explicit need to perform clustering, it 
is more for me to learn how to do it.

Any pointers to this effect would be greatly appreciated.

Tristan

On Sun, 1 May 2022 at 18:18, Patrick Walters 
mailto:wpwalt...@gmail.com>> wrote:
Similarity search on a database of 4 million is pretty quick with ChemFp or 
fpsim2.  Do you need to do the clustering?

Here are a couple of relevant blog posts.

http://practicalcheminformatics.blogspot.com/2020/10/what-do-molecules-that-look-like-this.html<https://eur05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpracticalcheminformatics.blogspot.com%2F2020%2F10%2Fwhat-do-molecules-that-look-like-this.html=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=%2B4neEhHjjQFPEbNLNWFyoJ9uuSCVMmpcQ34%2B32tykdI%3D=0>

http://practicalcheminformatics.blogspot.com/2021/09/similarity-search-and-some-cool-pandas.html<https://eur05.safelinks.protection.outlook.com/?url=http%3A%2F%2Fpracticalcheminformatics.blogspot.com%2F2021%2F09%2Fsimilarity-search-and-some-cool-pandas.html=05%7C01%7C%7Cd3c48e2fe21c4aef98fa08da2bf92532%7C627f3c33bccc48bba033c0a6521f7642%7C1%7C0%7C637870646337299295%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C3000%7C%7C%7C=jUGKrSrLogywBYsSyHqYdgNaY0IhUmCWovOSto8QmGk%3D

Re: [Rdkit-discuss] Clustering

2022-05-01 Thread Tristan Camilleri
Thanks for the feedback. Rather than an explicit need to perform
clustering, it is more for me to learn how to do it.

Any pointers to this effect would be greatly appreciated.

Tristan

On Sun, 1 May 2022 at 18:18, Patrick Walters  wrote:

> Similarity search on a database of 4 million is pretty quick with ChemFp
> or fpsim2.  Do you need to do the clustering?
>
> Here are a couple of relevant blog posts.
>
>
> http://practicalcheminformatics.blogspot.com/2020/10/what-do-molecules-that-look-like-this.html
>
>
> http://practicalcheminformatics.blogspot.com/2021/09/similarity-search-and-some-cool-pandas.html
>
> Pat
>
>
>
> On Sun, May 1, 2022 at 12:12 PM Tristan Camilleri <
> tristan.camilleri...@um.edu.mt> wrote:
>
>> Thank you both for the feedback.
>>
>> My primary aim is to run an LBVS experiment (similarity search) using a
>> set of actives and the dataset of cluster representatives.
>>
>>
>>
>> On Sun, 1 May 2022, 17:09 Patrick Walters,  wrote:
>>
>>> For me, a lot of this depends on what you intend to do with the
>>> clustering.  If you want to pick a "representative" subset from a larger
>>> dataset, k-means may do the trick.  As Rajarshi mentioned, Practical
>>> Cheminformatics has a k-means implementation that runs with FAISS.
>>> Depending on your goal, choosing a subset with a diversity picker may fit
>>> the bill.  One annoying aspect of diversity pickers is that the initial
>>> selections tend to consist of strange molecules.
>>>
>>> @Tristen can you provide more information on what you want to do with
>>> the clustering results?
>>>
>>>
>>> Pat
>>>
>>> On Sun, May 1, 2022 at 10:46 AM Rajarshi Guha 
>>> wrote:
>>>
 You could consider using FAISS. An example of clustering 2.1M cmpds is
 described at
 http://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html


 On Sun, May 1, 2022 at 9:23 AM Tristan Camilleri <
 tristan.camilleri...@um.edu.mt> wrote:

> Hi,
>
> I am attempting to cluster a database of circa 4M small molecules and
> I have hit several snags.
> Using BulkTanimoto is not possible due to resiurces that are required.
> I am now working with fpsim2 and chemfp to get a distance matrix (sparse
> matrix). However, I am finding it very challenging to identify an
> appropriate clustering algorithm. I have considered both k-medoids and
> DBSCAN. Each of these has its own limitations, stating the number of
> clusters for k-medoids and not obtaining centroids for DBSCAN.
>
> I was wondering whether there is an implementation of the stochastic
> clustering analysis for clustering purposes, described in
> https://doi.org/10.1021/ci970056l .
>
> Any suggestions on the best method for clustering large datasets, with
> code suggestions, would be greatly appreciated. I am new to the subject 
> and
> would appreciate any help.
>
> Regards,
> Tristan
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>


 --
 Rajarshi Guha | http://blog.rguha.net | @rguha
 

 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

>>>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering

2022-05-01 Thread Tristan Camilleri
Thank you both for the feedback.

My primary aim is to run an LBVS experiment (similarity search) using a set
of actives and the dataset of cluster representatives.



On Sun, 1 May 2022, 17:09 Patrick Walters,  wrote:

> For me, a lot of this depends on what you intend to do with the
> clustering.  If you want to pick a "representative" subset from a larger
> dataset, k-means may do the trick.  As Rajarshi mentioned, Practical
> Cheminformatics has a k-means implementation that runs with FAISS.
> Depending on your goal, choosing a subset with a diversity picker may fit
> the bill.  One annoying aspect of diversity pickers is that the initial
> selections tend to consist of strange molecules.
>
> @Tristen can you provide more information on what you want to do with the
> clustering results?
>
>
> Pat
>
> On Sun, May 1, 2022 at 10:46 AM Rajarshi Guha 
> wrote:
>
>> You could consider using FAISS. An example of clustering 2.1M cmpds is
>> described at
>> http://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html
>>
>>
>> On Sun, May 1, 2022 at 9:23 AM Tristan Camilleri <
>> tristan.camilleri...@um.edu.mt> wrote:
>>
>>> Hi,
>>>
>>> I am attempting to cluster a database of circa 4M small molecules and I
>>> have hit several snags.
>>> Using BulkTanimoto is not possible due to resiurces that are required. I
>>> am now working with fpsim2 and chemfp to get a distance matrix (sparse
>>> matrix). However, I am finding it very challenging to identify an
>>> appropriate clustering algorithm. I have considered both k-medoids and
>>> DBSCAN. Each of these has its own limitations, stating the number of
>>> clusters for k-medoids and not obtaining centroids for DBSCAN.
>>>
>>> I was wondering whether there is an implementation of the stochastic
>>> clustering analysis for clustering purposes, described in
>>> https://doi.org/10.1021/ci970056l .
>>>
>>> Any suggestions on the best method for clustering large datasets, with
>>> code suggestions, would be greatly appreciated. I am new to the subject and
>>> would appreciate any help.
>>>
>>> Regards,
>>> Tristan
>>>
>>>
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>
>>
>> --
>> Rajarshi Guha | http://blog.rguha.net | @rguha
>> 
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering

2022-05-01 Thread Patrick Walters
Similarity search on a database of 4 million is pretty quick with ChemFp or
fpsim2.  Do you need to do the clustering?

Here are a couple of relevant blog posts.

http://practicalcheminformatics.blogspot.com/2020/10/what-do-molecules-that-look-like-this.html

http://practicalcheminformatics.blogspot.com/2021/09/similarity-search-and-some-cool-pandas.html

Pat



On Sun, May 1, 2022 at 12:12 PM Tristan Camilleri <
tristan.camilleri...@um.edu.mt> wrote:

> Thank you both for the feedback.
>
> My primary aim is to run an LBVS experiment (similarity search) using a
> set of actives and the dataset of cluster representatives.
>
>
>
> On Sun, 1 May 2022, 17:09 Patrick Walters,  wrote:
>
>> For me, a lot of this depends on what you intend to do with the
>> clustering.  If you want to pick a "representative" subset from a larger
>> dataset, k-means may do the trick.  As Rajarshi mentioned, Practical
>> Cheminformatics has a k-means implementation that runs with FAISS.
>> Depending on your goal, choosing a subset with a diversity picker may fit
>> the bill.  One annoying aspect of diversity pickers is that the initial
>> selections tend to consist of strange molecules.
>>
>> @Tristen can you provide more information on what you want to do with the
>> clustering results?
>>
>>
>> Pat
>>
>> On Sun, May 1, 2022 at 10:46 AM Rajarshi Guha 
>> wrote:
>>
>>> You could consider using FAISS. An example of clustering 2.1M cmpds is
>>> described at
>>> http://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html
>>>
>>>
>>> On Sun, May 1, 2022 at 9:23 AM Tristan Camilleri <
>>> tristan.camilleri...@um.edu.mt> wrote:
>>>
 Hi,

 I am attempting to cluster a database of circa 4M small molecules and I
 have hit several snags.
 Using BulkTanimoto is not possible due to resiurces that are required.
 I am now working with fpsim2 and chemfp to get a distance matrix (sparse
 matrix). However, I am finding it very challenging to identify an
 appropriate clustering algorithm. I have considered both k-medoids and
 DBSCAN. Each of these has its own limitations, stating the number of
 clusters for k-medoids and not obtaining centroids for DBSCAN.

 I was wondering whether there is an implementation of the stochastic
 clustering analysis for clustering purposes, described in
 https://doi.org/10.1021/ci970056l .

 Any suggestions on the best method for clustering large datasets, with
 code suggestions, would be greatly appreciated. I am new to the subject and
 would appreciate any help.

 Regards,
 Tristan


 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

>>>
>>>
>>> --
>>> Rajarshi Guha | http://blog.rguha.net | @rguha
>>> 
>>>
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering

2022-05-01 Thread Patrick Walters
For me, a lot of this depends on what you intend to do with the
clustering.  If you want to pick a "representative" subset from a larger
dataset, k-means may do the trick.  As Rajarshi mentioned, Practical
Cheminformatics has a k-means implementation that runs with FAISS.
Depending on your goal, choosing a subset with a diversity picker may fit
the bill.  One annoying aspect of diversity pickers is that the initial
selections tend to consist of strange molecules.

@Tristen can you provide more information on what you want to do with the
clustering results?


Pat

On Sun, May 1, 2022 at 10:46 AM Rajarshi Guha 
wrote:

> You could consider using FAISS. An example of clustering 2.1M cmpds is
> described at
> http://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html
>
>
> On Sun, May 1, 2022 at 9:23 AM Tristan Camilleri <
> tristan.camilleri...@um.edu.mt> wrote:
>
>> Hi,
>>
>> I am attempting to cluster a database of circa 4M small molecules and I
>> have hit several snags.
>> Using BulkTanimoto is not possible due to resiurces that are required. I
>> am now working with fpsim2 and chemfp to get a distance matrix (sparse
>> matrix). However, I am finding it very challenging to identify an
>> appropriate clustering algorithm. I have considered both k-medoids and
>> DBSCAN. Each of these has its own limitations, stating the number of
>> clusters for k-medoids and not obtaining centroids for DBSCAN.
>>
>> I was wondering whether there is an implementation of the stochastic
>> clustering analysis for clustering purposes, described in
>> https://doi.org/10.1021/ci970056l .
>>
>> Any suggestions on the best method for clustering large datasets, with
>> code suggestions, would be greatly appreciated. I am new to the subject and
>> would appreciate any help.
>>
>> Regards,
>> Tristan
>>
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
> --
> Rajarshi Guha | http://blog.rguha.net | @rguha 
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering

2022-05-01 Thread Rajarshi Guha
You could consider using FAISS. An example of clustering 2.1M cmpds is
described at
http://practicalcheminformatics.blogspot.com/2019/04/clustering-21-million-compounds-for-5.html


On Sun, May 1, 2022 at 9:23 AM Tristan Camilleri <
tristan.camilleri...@um.edu.mt> wrote:

> Hi,
>
> I am attempting to cluster a database of circa 4M small molecules and I
> have hit several snags.
> Using BulkTanimoto is not possible due to resiurces that are required. I
> am now working with fpsim2 and chemfp to get a distance matrix (sparse
> matrix). However, I am finding it very challenging to identify an
> appropriate clustering algorithm. I have considered both k-medoids and
> DBSCAN. Each of these has its own limitations, stating the number of
> clusters for k-medoids and not obtaining centroids for DBSCAN.
>
> I was wondering whether there is an implementation of the stochastic
> clustering analysis for clustering purposes, described in
> https://doi.org/10.1021/ci970056l .
>
> Any suggestions on the best method for clustering large datasets, with
> code suggestions, would be greatly appreciated. I am new to the subject and
> would appreciate any help.
>
> Regards,
> Tristan
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>


-- 
Rajarshi Guha | http://blog.rguha.net | @rguha 
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering changes conformer ID?

2019-12-04 Thread topgunhaides .
Hi Greg,

Thanks for the help!

Sorry for the confusion. I was trying to get symmetric RMS matrix
using GetBestRMS, because the GetConformerRMSMatrix use standard RMS method
without considering symmetry.

A further question, is it possible to include the "GetBestRMS " option for
"EmbedMultipleConfs" in the near future?
I found that I lost a significant amount of conformers retained by
"EmbedMultipleConfs", after I do a "post-pruning" using "GetBestRMS" (with
same RMS threshold). My "post-pruning" code do the same type of pruning
(first conformer is retained and from then on only those that are at least
rmsd_threshold away from all retained conformations are kept)

Thank you!
Leon




On Mon, Dec 2, 2019 at 2:37 AM Greg Landrum  wrote:

> Hi Leon,
>
> I'm not sure I understand the question. The clustering code returns a
> tuple of indices of the clusters. Those indices are relative to the
> indexing of the distance matrix. The `ClusterData` function doesn't know
> what you're clustering, so there's no way it could know anything about
> cluster IDs.
>
> In your case, the way to get the conformer IDs of the conformers in the
> first cluster would be something like (not tested):
>
> confs = mh.GetConformers()
> print([confs[x].GetId() for x in clusters_a[0]])
>
> -greg
>
>
>
> On Mon, Nov 25, 2019 at 6:12 PM topgunhaides . 
> wrote:
>
>> Hi guys,
>>
>> Does clustering change conformer ID? See code below:
>>
>> from rdkit import Chem
>> from rdkit.Chem import AllChem, TorsionFingerprints
>> from rdkit.ML.Cluster import Butina
>>
>> mh = Chem.AddHs(Chem.MolFromSmiles('OCCCN'))
>> AllChem.EmbedMultipleConfs(mh, numConfs=5, maxAttempts=1000,
>>pruneRmsThresh=1.0, numThreads=0,
>> randomSeed=-1)
>>
>> print([conf.GetId() for conf in mh.GetConformers()])
>>
>> mh.RemoveConformer(0)
>> mh.RemoveConformer(1)
>>
>> print([conf.GetId() for conf in mh.GetConformers()])
>>
>> m = Chem.RemoveHs(mh)
>> mat_a = AllChem.GetConformerRMSMatrix(m, prealigned=False)
>> mat_b = TorsionFingerprints.GetTFDMatrix(m)
>> num = m.GetNumConformers()
>> clusters_a = Butina.ClusterData(mat_a, num, distThresh=2.0,
>> isDistData=True, reordering=False)
>> clusters_b = Butina.ClusterData(mat_b, num, distThresh=2.0,
>> isDistData=True, reordering=False)
>>
>> print(clusters_a)
>> print(clusters_b)
>>
>> print([conf.GetId() for conf in mh.GetConformers()])
>>
>>
>> Here is the result:
>> [0, 1, 2, 3, 4]
>> [2, 3, 4]
>> ((2, 0, 1),)
>> ((2, 0, 1),)
>> [2, 3, 4]
>>
>> You see it does not actually change the id in mh, but the result in the
>> tuple from clustering is actually index. Is this a bug? This could be
>> misleading when you try to grab conformer ids from the clustering result.
>> Thank you!
>>
>> Best,
>> Leon
>>
>>
>>
>>
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering changes conformer ID?

2019-12-02 Thread Greg Landrum
Hi Leon,

I'm not sure I understand the question. The clustering code returns a tuple
of indices of the clusters. Those indices are relative to the indexing of
the distance matrix. The `ClusterData` function doesn't know what you're
clustering, so there's no way it could know anything about cluster IDs.

In your case, the way to get the conformer IDs of the conformers in the
first cluster would be something like (not tested):

confs = mh.GetConformers()
print([confs[x].GetId() for x in clusters_a[0]])

-greg



On Mon, Nov 25, 2019 at 6:12 PM topgunhaides .  wrote:

> Hi guys,
>
> Does clustering change conformer ID? See code below:
>
> from rdkit import Chem
> from rdkit.Chem import AllChem, TorsionFingerprints
> from rdkit.ML.Cluster import Butina
>
> mh = Chem.AddHs(Chem.MolFromSmiles('OCCCN'))
> AllChem.EmbedMultipleConfs(mh, numConfs=5, maxAttempts=1000,
>pruneRmsThresh=1.0, numThreads=0, randomSeed=-1)
>
> print([conf.GetId() for conf in mh.GetConformers()])
>
> mh.RemoveConformer(0)
> mh.RemoveConformer(1)
>
> print([conf.GetId() for conf in mh.GetConformers()])
>
> m = Chem.RemoveHs(mh)
> mat_a = AllChem.GetConformerRMSMatrix(m, prealigned=False)
> mat_b = TorsionFingerprints.GetTFDMatrix(m)
> num = m.GetNumConformers()
> clusters_a = Butina.ClusterData(mat_a, num, distThresh=2.0,
> isDistData=True, reordering=False)
> clusters_b = Butina.ClusterData(mat_b, num, distThresh=2.0,
> isDistData=True, reordering=False)
>
> print(clusters_a)
> print(clusters_b)
>
> print([conf.GetId() for conf in mh.GetConformers()])
>
>
> Here is the result:
> [0, 1, 2, 3, 4]
> [2, 3, 4]
> ((2, 0, 1),)
> ((2, 0, 1),)
> [2, 3, 4]
>
> You see it does not actually change the id in mh, but the result in the
> tuple from clustering is actually index. Is this a bug? This could be
> misleading when you try to grab conformer ids from the clustering result.
> Thank you!
>
> Best,
> Leon
>
>
>
>
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering

2017-06-14 Thread Andrew Dalke
Following up on myself,

On Jun 6, 2017, at 04:00, Andrew Dalke  wrote:
> I've fleshed out that algorithm so it's a command-line program that can be 
> used for benchmarking purposes. It's available from 
> http://dalkescientific.com/writings/taylor_butina.py .
> 
> If anyone uses it for benchmarking, or has improvements, let me know. If I 
> get useful feedback about it, I'll include it in an upcoming chemfp 1.3 
> release.

Based on the discussion, I've decided to add a way to convert a chemfp 
SearchResults into a scipy sparse row matrix. This would be part of the 
upcoming 1.3 release (the no-cost version) and in 3.1 (the commercial version).

I need feedback because I have no experience with scipy or the clustering tools 
in scikit-learn. I have put a prototype version of the code, which works with 
chemfp-1.1, at

  http://dalkescientific.com/chemfp_to_scipy_csr.py 

In theory (see previous disclaimer), when run as a command-line tool it will 
use DBSCAN to cluster the specified fingerprint file.

Here's the command-line --help:

usage: chemfp_to_scipy_csr.py [-h] [-t FLOAT] [--eps FLOAT]
  [--min-samples INT] [--num-jobs INT]
  FILENAME

test prototype adapter between chemfp and scipy.cluster using DBSCAN

positional arguments:
  FILENAME

optional arguments:
  -h, --helpshow this help message and exit
  -t FLOAT, --threshold FLOAT
minimum similarity threshold (default: 0.8)
  --eps FLOAT   The maximum distance between two samples for them to
be considered as in the same neighborhood. (default:
0.1)
  --min-samples INT The number of samples (or total weight) in a
neighborhood for a point to be considered as a core
point. This includes the point itself. (default: 5)
  --num-jobs INT, -j INT
The number of parallel jobs to run. If -1, then the
number of jobs is set to the number of CPU cores.
(default: 1)

This is off-topic for the RDKit list so please follow up with me via email.


Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering

2017-06-12 Thread Peter S. Shenkin
" A clustering algorithm, that does not require specifying the number
of classes upfront (so not K-means)."

A general approach to O(N) hierarchical clustering is:

1. Pick a random sqrt(N) structures.
2. Do full hierarchical O(N^2) clustering on these.
3. Select your favored clustering level to define clusters, and store the
centroid (or most representative member) of each.
4. For all N structures, associate each with the cluster whose centroid (or
most representative member) is closest to N.

I've never tried this, but I've heard it suggested at talks, which were
not, however, about molecular clustering; but the method should be general.

Step 3 gives you some control.

-P.

On Mon, Jun 12, 2017 at 10:06 AM, Michał Nowotka  wrote:

> Hi,
>
> Thanks for all the answers, especially those pointing to code
> examples, very useful.
> I should be more specific when asking about clustering >2M compounds.
>
> An example I would like to see would use:
>
> 1. A clustering algorithm, that does not require specifying the number
> of classes upfront (so not K-means).
> 2. An algorithm that is a bit more sophisticated than Taylor-Butina
> 3. Preferably one from pyclustering (NOT from scipy.cluster, sorry for
> mistake)
>
> In those, somewhat more sophisticated algorithms, running PCA will not
> help. You can try to cluster >2M points on 2D surface and you will
> find out that this is not a trivial task.
>
> That being said, I don't expect any amazing results when doing
> compound clustering using those algorithms. And I agree that
> clustering a random sample can give similar results. This question is
> more out of curiosity.
>
> Michał
>
> On Sun, Jun 11, 2017 at 7:58 PM, Samo Turk  wrote:
> > Hi All,
> >
> > I have to admit I was commenting about PCA->k-means without actually
> trying.
> > Out of curiosity I implemented it here:
> > https://github.com/samoturk/cheminf-notebooks/tree/master/
> Python#pca-k-meanspy
> >
> > It can process 4M compounds in ~60 minutes on desktop i5 and it should
> work
> > with 16GB or RAM. Clusters that come out make (some) sense but in this
> > regard Butina is better.
> >
> > PS. DataWarrior can easily load the results (it takes 30 min) but then it
> > works smoothly.
> >
> > Cheers,
> > Samo
> >
> > On Mon, Jun 5, 2017 at 7:46 PM, Abhik Seal  wrote:
> >>
> >> Hello all ,
> >>
> >> How about doing some dimension reduction using  pca or Tsne and then run
> >> clustering using some selected top components like top 20 and I think
> then
> >> the clustering would be fast .
> >>
> >> Thanks
> >> Abhik
> >>
> >> On Mon, Jun 5, 2017 at 6:11 AM David Cosgrove <
> davidacosgrov...@gmail.com>
> >> wrote:
> >>>
> >>> Hi,
> >>> I have used this algorithm for many years clustering sets of several
> >>> millions of compounds.  Indeed, I am old enough to know it as the
> Taylor
> >>> algorithm.  It is slow but reliable.  A crucial setting is the
> similarity
> >>> threshold for the clusters, which dictates the size of the neighbour
> lists
> >>> and hence the amount of RAM required.  It also, of course, determines
> the
> >>> quality of the clusters.  My implementation is at
> >>> https://github.com/OpenEye-Contrib/Flush.git.  This repo has a number
> of
> >>> programs of relevance, the one you want is called cluster.  I have just
> >>> confirmed that it compiles on ubuntu 16.  It needs the fingerprints as
> ascii
> >>> bitstrings, I don't have code for turning RDKit fingerprints into this
> >>> format, but I would imagine it's quite straightforward.  The program
> runs in
> >>> parallel using OpenMPI.  That's valuable for two reasons.  One is
> speed, but
> >>> the more important one is memory use.  If you can spread the slave
> processes
> >>> over several machines you can cluster much larger sets of molecules as
> you
> >>> are effectively expanding the RAM of the machine.  When I wrote the
> >>> original, 64MB was a lot of RAM, it is less of an issue these days but
> still
> >>> matters if clustering millions of fingerprints.  Note that the program
> >>> cluster doesn't ever store the distance matrix, just the lists of
> neighbours
> >>> for each molecule within the threshold.  This reduces the memory
> footprint
> >>> substantially if you have a tight-enough cluster threshold.
> >>> HTH,
> >>> Dave
> >>>
> >>>
> >>>
> >>> On Mon, Jun 5, 2017 at 11:22 AM, Nils Weskamp 
> >>> wrote:
> 
>  Hi Michal,
> 
>  I have done this a couple of times for compound sets up to 10M+ using
> a
>  simplified variant of the Taylor-Butina algorithm. The overall run
> time
>  was in the range of hours to a few days (which could probably be
>  optimized, but was fast enough for me).
> 
>  As you correctly mentioned, getting the (sparse) similarity matrix is
>  fairly simple (and can be done in parallel on a cluster).
> Unfortunately,
>  this matrix gets very large (even the sparse version). 

Re: [Rdkit-discuss] Clustering

2017-06-12 Thread Michał Nowotka
Hi,

Thanks for all the answers, especially those pointing to code
examples, very useful.
I should be more specific when asking about clustering >2M compounds.

An example I would like to see would use:

1. A clustering algorithm, that does not require specifying the number
of classes upfront (so not K-means).
2. An algorithm that is a bit more sophisticated than Taylor-Butina
3. Preferably one from pyclustering (NOT from scipy.cluster, sorry for mistake)

In those, somewhat more sophisticated algorithms, running PCA will not
help. You can try to cluster >2M points on 2D surface and you will
find out that this is not a trivial task.

That being said, I don't expect any amazing results when doing
compound clustering using those algorithms. And I agree that
clustering a random sample can give similar results. This question is
more out of curiosity.

Michał

On Sun, Jun 11, 2017 at 7:58 PM, Samo Turk  wrote:
> Hi All,
>
> I have to admit I was commenting about PCA->k-means without actually trying.
> Out of curiosity I implemented it here:
> https://github.com/samoturk/cheminf-notebooks/tree/master/Python#pca-k-meanspy
>
> It can process 4M compounds in ~60 minutes on desktop i5 and it should work
> with 16GB or RAM. Clusters that come out make (some) sense but in this
> regard Butina is better.
>
> PS. DataWarrior can easily load the results (it takes 30 min) but then it
> works smoothly.
>
> Cheers,
> Samo
>
> On Mon, Jun 5, 2017 at 7:46 PM, Abhik Seal  wrote:
>>
>> Hello all ,
>>
>> How about doing some dimension reduction using  pca or Tsne and then run
>> clustering using some selected top components like top 20 and I think then
>> the clustering would be fast .
>>
>> Thanks
>> Abhik
>>
>> On Mon, Jun 5, 2017 at 6:11 AM David Cosgrove 
>> wrote:
>>>
>>> Hi,
>>> I have used this algorithm for many years clustering sets of several
>>> millions of compounds.  Indeed, I am old enough to know it as the Taylor
>>> algorithm.  It is slow but reliable.  A crucial setting is the similarity
>>> threshold for the clusters, which dictates the size of the neighbour lists
>>> and hence the amount of RAM required.  It also, of course, determines the
>>> quality of the clusters.  My implementation is at
>>> https://github.com/OpenEye-Contrib/Flush.git.  This repo has a number of
>>> programs of relevance, the one you want is called cluster.  I have just
>>> confirmed that it compiles on ubuntu 16.  It needs the fingerprints as ascii
>>> bitstrings, I don't have code for turning RDKit fingerprints into this
>>> format, but I would imagine it's quite straightforward.  The program runs in
>>> parallel using OpenMPI.  That's valuable for two reasons.  One is speed, but
>>> the more important one is memory use.  If you can spread the slave processes
>>> over several machines you can cluster much larger sets of molecules as you
>>> are effectively expanding the RAM of the machine.  When I wrote the
>>> original, 64MB was a lot of RAM, it is less of an issue these days but still
>>> matters if clustering millions of fingerprints.  Note that the program
>>> cluster doesn't ever store the distance matrix, just the lists of neighbours
>>> for each molecule within the threshold.  This reduces the memory footprint
>>> substantially if you have a tight-enough cluster threshold.
>>> HTH,
>>> Dave
>>>
>>>
>>>
>>> On Mon, Jun 5, 2017 at 11:22 AM, Nils Weskamp 
>>> wrote:

 Hi Michal,

 I have done this a couple of times for compound sets up to 10M+ using a
 simplified variant of the Taylor-Butina algorithm. The overall run time
 was in the range of hours to a few days (which could probably be
 optimized, but was fast enough for me).

 As you correctly mentioned, getting the (sparse) similarity matrix is
 fairly simple (and can be done in parallel on a cluster). Unfortunately,
 this matrix gets very large (even the sparse version). Most clustering
 algorithms require random access to the matrix, so you have to keep it
 in main memory (which then has to be huge) or calculate it on-the-fly
 (takes forever).

 My implementation (in C++, not sure if I can share it) assumes that the
 similarity matrix has been pre-calculated and is stored in one (or
 multiple) files. It reads these files sequentially and whenever a
 compound pair with a similarity beyond the threshold is found, it checks
 whether one of the cpds. is already a centroid (in which case the other
 is assigned to it). Otherwise, one of the compounds is randomly chosen
 as centroid and the other is assigned to it.

 This procedure is highly order-dependent and thus not optimal, but has
 to read the whole similarity matrix only once and has limited memory
 consumption (you only need to keep a list of centroids). If you still
 run into memory issues, you can start by clustering with a 

Re: [Rdkit-discuss] Clustering

2017-06-11 Thread Samo Turk
Hi All,

I have to admit I was commenting about PCA->k-means without actually
trying. Out of curiosity I implemented it here:
https://github.com/samoturk/cheminf-notebooks/tree/master/Python#pca-k-meanspy

It can process 4M compounds in ~60 minutes on desktop i5 and it should work
with 16GB or RAM. Clusters that come out make (some) sense but in this
regard Butina is better.

PS. DataWarrior can easily load the results (it takes 30 min) but then it
works smoothly.

Cheers,
Samo

On Mon, Jun 5, 2017 at 7:46 PM, Abhik Seal  wrote:

> Hello all ,
>
> How about doing some dimension reduction using  pca or Tsne and then run
> clustering using some selected top components like top 20 and I think then
> the clustering would be fast .
>
> Thanks
> Abhik
>
> On Mon, Jun 5, 2017 at 6:11 AM David Cosgrove 
> wrote:
>
>> Hi,
>> I have used this algorithm for many years clustering sets of several
>> millions of compounds.  Indeed, I am old enough to know it as the Taylor
>> algorithm.  It is slow but reliable.  A crucial setting is the similarity
>> threshold for the clusters, which dictates the size of the neighbour lists
>> and hence the amount of RAM required.  It also, of course, determines the
>> quality of the clusters.  My implementation is at
>> https://github.com/OpenEye-Contrib/Flush.git.  This repo has a number of
>> programs of relevance, the one you want is called cluster.  I have just
>> confirmed that it compiles on ubuntu 16.  It needs the fingerprints as
>> ascii bitstrings, I don't have code for turning RDKit fingerprints into
>> this format, but I would imagine it's quite straightforward.  The program
>> runs in parallel using OpenMPI.  That's valuable for two reasons.  One is
>> speed, but the more important one is memory use.  If you can spread the
>> slave processes over several machines you can cluster much larger sets of
>> molecules as you are effectively expanding the RAM of the machine.  When I
>> wrote the original, 64MB was a lot of RAM, it is less of an issue these
>> days but still matters if clustering millions of fingerprints.  Note that
>> the program cluster doesn't ever store the distance matrix, just the lists
>> of neighbours for each molecule within the threshold.  This reduces the
>> memory footprint substantially if you have a tight-enough cluster threshold.
>> HTH,
>> Dave
>>
>>
>>
>> On Mon, Jun 5, 2017 at 11:22 AM, Nils Weskamp 
>> wrote:
>>
>>> Hi Michal,
>>>
>>> I have done this a couple of times for compound sets up to 10M+ using a
>>> simplified variant of the Taylor-Butina algorithm. The overall run time
>>> was in the range of hours to a few days (which could probably be
>>> optimized, but was fast enough for me).
>>>
>>> As you correctly mentioned, getting the (sparse) similarity matrix is
>>> fairly simple (and can be done in parallel on a cluster). Unfortunately,
>>> this matrix gets very large (even the sparse version). Most clustering
>>> algorithms require random access to the matrix, so you have to keep it
>>> in main memory (which then has to be huge) or calculate it on-the-fly
>>> (takes forever).
>>>
>>> My implementation (in C++, not sure if I can share it) assumes that the
>>> similarity matrix has been pre-calculated and is stored in one (or
>>> multiple) files. It reads these files sequentially and whenever a
>>> compound pair with a similarity beyond the threshold is found, it checks
>>> whether one of the cpds. is already a centroid (in which case the other
>>> is assigned to it). Otherwise, one of the compounds is randomly chosen
>>> as centroid and the other is assigned to it.
>>>
>>> This procedure is highly order-dependent and thus not optimal, but has
>>> to read the whole similarity matrix only once and has limited memory
>>> consumption (you only need to keep a list of centroids). If you still
>>> run into memory issues, you can start by clustering with a high
>>> similarity threshold and then re-cluster centroids and singletons on a
>>> lower threshold level.
>>>
>>> I also played around with DBSCAN for large compound databases, but (as
>>> previously mentioned by Samo) found it difficult to find the right
>>> parameters and ended up with a single huge cluster covering 90 percent
>>> of the database in many cases.
>>>
>>> Hope this helps,
>>> Nils
>>>
>>> Am 05.06.2017 um 11:02 schrieb Michał Nowotka:
>>> > Is there anyone who actually done this: clustered >2M compounds using
>>> > any well-known clustering algorithm and is willing to share a code and
>>> > some performance statistics?
>>>
>>>
>>> 
>>> --
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> 

Re: [Rdkit-discuss] Clustering

2017-06-06 Thread Abhik Seal
Hello all ,

How about doing some dimension reduction using  pca or Tsne and then run
clustering using some selected top components like top 20 and I think then
the clustering would be fast .

Thanks
Abhik

On Mon, Jun 5, 2017 at 6:11 AM David Cosgrove 
wrote:

> Hi,
> I have used this algorithm for many years clustering sets of several
> millions of compounds.  Indeed, I am old enough to know it as the Taylor
> algorithm.  It is slow but reliable.  A crucial setting is the similarity
> threshold for the clusters, which dictates the size of the neighbour lists
> and hence the amount of RAM required.  It also, of course, determines the
> quality of the clusters.  My implementation is at
> https://github.com/OpenEye-Contrib/Flush.git.  This repo has a number of
> programs of relevance, the one you want is called cluster.  I have just
> confirmed that it compiles on ubuntu 16.  It needs the fingerprints as
> ascii bitstrings, I don't have code for turning RDKit fingerprints into
> this format, but I would imagine it's quite straightforward.  The program
> runs in parallel using OpenMPI.  That's valuable for two reasons.  One is
> speed, but the more important one is memory use.  If you can spread the
> slave processes over several machines you can cluster much larger sets of
> molecules as you are effectively expanding the RAM of the machine.  When I
> wrote the original, 64MB was a lot of RAM, it is less of an issue these
> days but still matters if clustering millions of fingerprints.  Note that
> the program cluster doesn't ever store the distance matrix, just the lists
> of neighbours for each molecule within the threshold.  This reduces the
> memory footprint substantially if you have a tight-enough cluster threshold.
> HTH,
> Dave
>
>
>
> On Mon, Jun 5, 2017 at 11:22 AM, Nils Weskamp 
> wrote:
>
>> Hi Michal,
>>
>> I have done this a couple of times for compound sets up to 10M+ using a
>> simplified variant of the Taylor-Butina algorithm. The overall run time
>> was in the range of hours to a few days (which could probably be
>> optimized, but was fast enough for me).
>>
>> As you correctly mentioned, getting the (sparse) similarity matrix is
>> fairly simple (and can be done in parallel on a cluster). Unfortunately,
>> this matrix gets very large (even the sparse version). Most clustering
>> algorithms require random access to the matrix, so you have to keep it
>> in main memory (which then has to be huge) or calculate it on-the-fly
>> (takes forever).
>>
>> My implementation (in C++, not sure if I can share it) assumes that the
>> similarity matrix has been pre-calculated and is stored in one (or
>> multiple) files. It reads these files sequentially and whenever a
>> compound pair with a similarity beyond the threshold is found, it checks
>> whether one of the cpds. is already a centroid (in which case the other
>> is assigned to it). Otherwise, one of the compounds is randomly chosen
>> as centroid and the other is assigned to it.
>>
>> This procedure is highly order-dependent and thus not optimal, but has
>> to read the whole similarity matrix only once and has limited memory
>> consumption (you only need to keep a list of centroids). If you still
>> run into memory issues, you can start by clustering with a high
>> similarity threshold and then re-cluster centroids and singletons on a
>> lower threshold level.
>>
>> I also played around with DBSCAN for large compound databases, but (as
>> previously mentioned by Samo) found it difficult to find the right
>> parameters and ended up with a single huge cluster covering 90 percent
>> of the database in many cases.
>>
>> Hope this helps,
>> Nils
>>
>> Am 05.06.2017 um 11:02 schrieb Michał Nowotka:
>> > Is there anyone who actually done this: clustered >2M compounds using
>> > any well-known clustering algorithm and is willing to share a code and
>> > some performance statistics?
>>
>>
>>
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
>
> --
> David Cosgrove
> Freelance computational chemistry and chemoinformatics developer
> http://cozchemix.co.uk
>
>
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
-- 

Cheers,
Abhik Seal  Ph.D. (Cheminformatics)

Re: [Rdkit-discuss] Clustering

2017-06-05 Thread Andrew Dalke
On Jun 5, 2017, at 11:02, Michał Nowotka  wrote:
> Is there anyone who actually done this: clustered >2M compounds using
> any well-known clustering algorithm and is willing to share a code and
> some performance statistics?

Yes. People regularly use chemfp (http://chemfp.com/) to cluster >2M compound 
data sets using the Taylor-Butina algorithm.

> It's easy to get a sparse distance matrix using chemfp. But if you
> take this matrix and feed it into any scipy.cluster you want get any
> results in a reasonable time.


On Jun 5, 2017, at 10:19, Gonzalo Colmenarejo  
wrote:
> I think there are faster things, like chemfp (see for instance 
> https://chemfp.readthedocs.io/en/latest/using-api.html#taylor-butina-clustering).
>  



I've fleshed out that algorithm so it's a command-line program that can be used 
for benchmarking purposes. It's available from 
http://dalkescientific.com/writings/taylor_butina.py .

If anyone uses it for benchmarking, or has improvements, let me know. If I get 
useful feedback about it, I'll include it in an upcoming chemfp 1.3 release.



On my 7-year old desktop (3.2 GHz Intel Core i3), with 4 OpenMP threads (the 
default) I estimate it would take a bit over an hour to cluster 1M 
fingerprints, with 2048 bits/fingerprint, using a threshold of 0.8.

I just ran it now with the following:

% time python taylor_butina.py --profile --threshold 0.8 pubchem_million.fps -o 
pubchem_million.clusters

The profile report says:

#fingerprints: 100 #bits/fp: 881 threshold: 0.8 #matches: 667808946
Load time: 4.9 sec memory: 288 MB
Similarity time: 2037.3 sec memory: 8.00 GB
Clustering time: 17.5 sec memory: -405549056 B
Total time: 2061.3 sec
7044.489u 323.528s 34:26.26 356.5%  0+0k 5+13io 233pf+0w

The "load" lines says the data needed 288 MB for 1M 881 bit fingerprints. This 
scales linearly in the number of bits and the number of records. The load time 
is small compared to the similarity time.

The "similarity" search took about 8GB for a sparse matrix with 667,808,946 
terms, or about 13 bytes per term (each hit requires an int (4 bytes) and a 
double (8 bytes), plus overhead.) This does not depend on the number of bits in 
the fingerprint, only the number of matches in the matrix.

The "similarity" time is linear in the number of bits, so 2048 bits would be 
about 2.3x slower. It's quadratic in the number of fingerprints, so 2M 
fingerprints would take about 2 hours.

It took 18 seconds to do the Taylor-Butina clustering, given the sparse matrix. 
On a multi-core machine, if you watch the CPU usage you'll easily see when it 
goes from the multi-threaded similarity search code (in C using OpenMP) to the 
single-threaded clustering code.

The "clustering" line says it took -405549056 bytes. That should be -387 MB. I 
expected this value to be positive. I'm not sure what's happening there to give 
a negative number, but that's not an important term.


Cheers,
Andrew
da...@dalkescientific.com



--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering

2017-06-05 Thread David Cosgrove
Hi,
I have used this algorithm for many years clustering sets of several
millions of compounds.  Indeed, I am old enough to know it as the Taylor
algorithm.  It is slow but reliable.  A crucial setting is the similarity
threshold for the clusters, which dictates the size of the neighbour lists
and hence the amount of RAM required.  It also, of course, determines the
quality of the clusters.  My implementation is at
https://github.com/OpenEye-Contrib/Flush.git.  This repo has a number of
programs of relevance, the one you want is called cluster.  I have just
confirmed that it compiles on ubuntu 16.  It needs the fingerprints as
ascii bitstrings, I don't have code for turning RDKit fingerprints into
this format, but I would imagine it's quite straightforward.  The program
runs in parallel using OpenMPI.  That's valuable for two reasons.  One is
speed, but the more important one is memory use.  If you can spread the
slave processes over several machines you can cluster much larger sets of
molecules as you are effectively expanding the RAM of the machine.  When I
wrote the original, 64MB was a lot of RAM, it is less of an issue these
days but still matters if clustering millions of fingerprints.  Note that
the program cluster doesn't ever store the distance matrix, just the lists
of neighbours for each molecule within the threshold.  This reduces the
memory footprint substantially if you have a tight-enough cluster threshold.
HTH,
Dave



On Mon, Jun 5, 2017 at 11:22 AM, Nils Weskamp 
wrote:

> Hi Michal,
>
> I have done this a couple of times for compound sets up to 10M+ using a
> simplified variant of the Taylor-Butina algorithm. The overall run time
> was in the range of hours to a few days (which could probably be
> optimized, but was fast enough for me).
>
> As you correctly mentioned, getting the (sparse) similarity matrix is
> fairly simple (and can be done in parallel on a cluster). Unfortunately,
> this matrix gets very large (even the sparse version). Most clustering
> algorithms require random access to the matrix, so you have to keep it
> in main memory (which then has to be huge) or calculate it on-the-fly
> (takes forever).
>
> My implementation (in C++, not sure if I can share it) assumes that the
> similarity matrix has been pre-calculated and is stored in one (or
> multiple) files. It reads these files sequentially and whenever a
> compound pair with a similarity beyond the threshold is found, it checks
> whether one of the cpds. is already a centroid (in which case the other
> is assigned to it). Otherwise, one of the compounds is randomly chosen
> as centroid and the other is assigned to it.
>
> This procedure is highly order-dependent and thus not optimal, but has
> to read the whole similarity matrix only once and has limited memory
> consumption (you only need to keep a list of centroids). If you still
> run into memory issues, you can start by clustering with a high
> similarity threshold and then re-cluster centroids and singletons on a
> lower threshold level.
>
> I also played around with DBSCAN for large compound databases, but (as
> previously mentioned by Samo) found it difficult to find the right
> parameters and ended up with a single huge cluster covering 90 percent
> of the database in many cases.
>
> Hope this helps,
> Nils
>
> Am 05.06.2017 um 11:02 schrieb Michał Nowotka:
> > Is there anyone who actually done this: clustered >2M compounds using
> > any well-known clustering algorithm and is willing to share a code and
> > some performance statistics?
>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>



-- 
David Cosgrove
Freelance computational chemistry and chemoinformatics developer
http://cozchemix.co.uk
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering

2017-06-05 Thread Nils Weskamp
Hi Michal,

I have done this a couple of times for compound sets up to 10M+ using a
simplified variant of the Taylor-Butina algorithm. The overall run time
was in the range of hours to a few days (which could probably be
optimized, but was fast enough for me).

As you correctly mentioned, getting the (sparse) similarity matrix is
fairly simple (and can be done in parallel on a cluster). Unfortunately,
this matrix gets very large (even the sparse version). Most clustering
algorithms require random access to the matrix, so you have to keep it
in main memory (which then has to be huge) or calculate it on-the-fly
(takes forever).

My implementation (in C++, not sure if I can share it) assumes that the
similarity matrix has been pre-calculated and is stored in one (or
multiple) files. It reads these files sequentially and whenever a
compound pair with a similarity beyond the threshold is found, it checks
whether one of the cpds. is already a centroid (in which case the other
is assigned to it). Otherwise, one of the compounds is randomly chosen
as centroid and the other is assigned to it.

This procedure is highly order-dependent and thus not optimal, but has
to read the whole similarity matrix only once and has limited memory
consumption (you only need to keep a list of centroids). If you still
run into memory issues, you can start by clustering with a high
similarity threshold and then re-cluster centroids and singletons on a
lower threshold level.

I also played around with DBSCAN for large compound databases, but (as
previously mentioned by Samo) found it difficult to find the right
parameters and ended up with a single huge cluster covering 90 percent
of the database in many cases.

Hope this helps,
Nils

Am 05.06.2017 um 11:02 schrieb Michał Nowotka:
> Is there anyone who actually done this: clustered >2M compounds using
> any well-known clustering algorithm and is willing to share a code and
> some performance statistics?


--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering

2017-06-05 Thread Chris Swain
Hi,

I’m just starting but I can add another example

I tried the clustering as described for the Butina clustering 
(http://www.rdkit.org/docs/Cookbook.html 
) using a Jupiter Notebook.

Worked fine on data sets < 10,000 molecules but kernel crash when I tried 
150,000 molecules.

Plan to try some other examples this week and will report back findings.

Chris


> On 5 Jun 2017, at 10:02, Michał Nowotka  wrote:
> 
> Hi,
> 
> Is there anyone who actually done this: clustered >2M compounds using
> any well-known clustering algorithm and is willing to share a code and
> some performance statistics?
> 
> It's easy to get a sparse distance matrix using chemfp. But if you
> take this matrix and feed it into any scipy.cluster you want get any
> results in a reasonable time.
> 
> We also tried to extract 10 most significant features from the latent
> representation described in this paper:
> https://arxiv.org/pdf/1610.02415v1.pdf for all compounds in ChEMBL and
> then use this web-based tool to generate visualization
> https://github.com/tensorflow/embedding-projector-standalone but
> obviously we didn't get anything useful from this.
> 
> My last attempt was to use sfdp tool from graphviz package to get some
> sort of primitive clustering. I allocated a lot of RAM memory to the
> process but without any luck as well.
> 
> I would be interested in all kinds of hints related to clustering
> millions of compounds, especially using DBSCAN/OPTICS-based clustering
> algorithms.
> 
> Regards,
> 
> Michał Nowotka
> 
> On Mon, Jun 5, 2017 at 9:19 AM, Gonzalo Colmenarejo
>  wrote:
>> Hi Chris,
>> 
>> as far as I know, Butina's sphere exclusion algorithm is the fastest for
>> very large datasets. But if you have 4 million compounds, using RDKit
>> directly can result in very long runs, even after parallellization. For that
>> number of molecules I think there are faster things, like chemfp (see for
>> instance
>> https://chemfp.readthedocs.io/en/latest/using-api.html#taylor-butina-clustering).
>> 
>> Cheers
>> 
>> Gonzalo
>> 
>> On Sun, Jun 4, 2017 at 3:12 PM, Maciek Wójcikowski 
>> wrote:
>>> 
>>> Is there a big difference in the quality of the final dataset between
>>> K-means and random under-sampling of big database (~20M)?
>>> 
>>> 
>>> Pozdrawiam,  |  Best regards,
>>> Maciek Wójcikowski
>>> mac...@wojcikowski.pl
>>> 
>>> 2017-06-04 12:24 GMT+02:00 Samo Turk :
 
 Hi Chris,
 
 There are other options for clustering. According to this:
 http://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html
 HDBSCAN and K-means scale well. HDBSCAN will find clusters based on
 density and it also allows for outliers, but can be fiddly to find the 
 right
 parametes. You can not specify the number of clusters (like in Butina 
 case).
 If you want to specify the number of clusters, you can simply use K-means.
 High dimensionality of fingerprints might be a problem for memory
 consumption. In this case you can use PCA to reduce dimensions to something
 manageable. To avoid memory issues with PCA and speed things up I would fit
 the model on random 100k compounds and then just use fit_transform method 
 on
 the rest.
 http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
 
 Cheers,
 Samo
 
 On Sun, Jun 4, 2017 at 9:08 AM, Chris Swain  wrote:
> 
> Hi,
> 
> I want to do clustering on around 4 million structures
> 
> The Rdkit cookbook (http://www.rdkit.org/docs/Cookbook.html) suggests
> 
> "For large sets of molecules (more than 1000-2000), it’s most efficient
> to use the Butina clustering algorithm”
> 
> However it is quite a step up from a few thousand to several million
> and I wondered if anyone had used this algorithm on larger data sets?
> 
> As far as I can tell it is not possible to define the number of
> clusters, is this correct?
> 
> Cheers,
> 
> Chris
> 
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
> 
 
 
 
 --
 Check out the vibrant tech community on one of the world's most
 engaging tech sites, Slashdot.org! http://sdm.link/slashdot
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 

Re: [Rdkit-discuss] Clustering

2017-06-05 Thread Michał Nowotka
Hi,

Is there anyone who actually done this: clustered >2M compounds using
any well-known clustering algorithm and is willing to share a code and
some performance statistics?

It's easy to get a sparse distance matrix using chemfp. But if you
take this matrix and feed it into any scipy.cluster you want get any
results in a reasonable time.

We also tried to extract 10 most significant features from the latent
representation described in this paper:
https://arxiv.org/pdf/1610.02415v1.pdf for all compounds in ChEMBL and
then use this web-based tool to generate visualization
https://github.com/tensorflow/embedding-projector-standalone but
obviously we didn't get anything useful from this.

My last attempt was to use sfdp tool from graphviz package to get some
sort of primitive clustering. I allocated a lot of RAM memory to the
process but without any luck as well.

I would be interested in all kinds of hints related to clustering
millions of compounds, especially using DBSCAN/OPTICS-based clustering
algorithms.

Regards,

Michał Nowotka

On Mon, Jun 5, 2017 at 9:19 AM, Gonzalo Colmenarejo
 wrote:
> Hi Chris,
>
> as far as I know, Butina's sphere exclusion algorithm is the fastest for
> very large datasets. But if you have 4 million compounds, using RDKit
> directly can result in very long runs, even after parallellization. For that
> number of molecules I think there are faster things, like chemfp (see for
> instance
> https://chemfp.readthedocs.io/en/latest/using-api.html#taylor-butina-clustering).
>
> Cheers
>
> Gonzalo
>
> On Sun, Jun 4, 2017 at 3:12 PM, Maciek Wójcikowski 
> wrote:
>>
>> Is there a big difference in the quality of the final dataset between
>> K-means and random under-sampling of big database (~20M)?
>>
>> 
>> Pozdrawiam,  |  Best regards,
>> Maciek Wójcikowski
>> mac...@wojcikowski.pl
>>
>> 2017-06-04 12:24 GMT+02:00 Samo Turk :
>>>
>>> Hi Chris,
>>>
>>> There are other options for clustering. According to this:
>>> http://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html
>>> HDBSCAN and K-means scale well. HDBSCAN will find clusters based on
>>> density and it also allows for outliers, but can be fiddly to find the right
>>> parametes. You can not specify the number of clusters (like in Butina case).
>>> If you want to specify the number of clusters, you can simply use K-means.
>>> High dimensionality of fingerprints might be a problem for memory
>>> consumption. In this case you can use PCA to reduce dimensions to something
>>> manageable. To avoid memory issues with PCA and speed things up I would fit
>>> the model on random 100k compounds and then just use fit_transform method on
>>> the rest.
>>> http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html
>>>
>>> Cheers,
>>> Samo
>>>
>>> On Sun, Jun 4, 2017 at 9:08 AM, Chris Swain  wrote:

 Hi,

 I want to do clustering on around 4 million structures

 The Rdkit cookbook (http://www.rdkit.org/docs/Cookbook.html) suggests

 "For large sets of molecules (more than 1000-2000), it’s most efficient
 to use the Butina clustering algorithm”

  However it is quite a step up from a few thousand to several million
 and I wondered if anyone had used this algorithm on larger data sets?

 As far as I can tell it is not possible to define the number of
 clusters, is this correct?

 Cheers,

 Chris


 --
 Check out the vibrant tech community on one of the world's most
 engaging tech sites, Slashdot.org! http://sdm.link/slashdot
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

>>>
>>>
>>>
>>> --
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>
>>
>>
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> 

Re: [Rdkit-discuss] Clustering

2017-06-05 Thread Gonzalo Colmenarejo
Hi Chris,

as far as I know, Butina's sphere exclusion algorithm is the fastest for
very large datasets. But if you have 4 million compounds, using RDKit
directly can result in very long runs, even after parallellization. For
that number of molecules I think there are faster things, like chemfp (see
for instance
https://chemfp.readthedocs.io/en/latest/using-api.html#taylor-butina-clustering
).

Cheers

Gonzalo

On Sun, Jun 4, 2017 at 3:12 PM, Maciek Wójcikowski 
wrote:

> Is there a big difference in the quality of the final dataset between
> K-means and random under-sampling of big database (~20M)?
>
> 
> Pozdrawiam,  |  Best regards,
> Maciek Wójcikowski
> mac...@wojcikowski.pl
>
> 2017-06-04 12:24 GMT+02:00 Samo Turk :
>
>> Hi Chris,
>>
>> There are other options for clustering. According to this:
>> http://hdbscan.readthedocs.io/en/latest/performance_and_scalability.html
>> HDBSCAN and K-means scale well. HDBSCAN will find clusters based on
>> density and it also allows for outliers, but can be fiddly to find the
>> right parametes. You can not specify the number of clusters (like in Butina
>> case). If you want to specify the number of clusters, you can simply use
>> K-means. High dimensionality of fingerprints might be a problem for memory
>> consumption. In this case you can use PCA to reduce dimensions to something
>> manageable. To avoid memory issues with PCA and speed things up I would fit
>> the model on random 100k compounds and then just use fit_transform method
>> on the rest. http://scikit-learn.org/stable/modules/generated/sklea
>> rn.decomposition.PCA.html
>>
>> Cheers,
>> Samo
>>
>> On Sun, Jun 4, 2017 at 9:08 AM, Chris Swain  wrote:
>>
>>> Hi,
>>>
>>> I want to do clustering on around 4 million structures
>>>
>>> The Rdkit cookbook (http://www.rdkit.org/docs/Cookbook.html) suggests
>>>
>>> "For large sets of molecules (more than 1000-2000), it’s most efficient
>>> to use the Butina clustering algorithm”
>>>
>>>  However it is quite a step up from a few thousand to several million
>>> and I wondered if anyone had used this algorithm on larger data sets?
>>>
>>> As far as I can tell it is not possible to define the number of
>>> clusters, is this correct?
>>>
>>> Cheers,
>>>
>>> Chris
>>>
>>> 
>>> --
>>> Check out the vibrant tech community on one of the world's most
>>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>>> ___
>>> Rdkit-discuss mailing list
>>> Rdkit-discuss@lists.sourceforge.net
>>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>>
>>>
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering

2017-06-04 Thread Maciek Wójcikowski
Is there a big difference in the quality of the final dataset between
K-means and random under-sampling of big database (~20M)?


Pozdrawiam,  |  Best regards,
Maciek Wójcikowski
mac...@wojcikowski.pl

2017-06-04 12:24 GMT+02:00 Samo Turk :

> Hi Chris,
>
> There are other options for clustering. According to this: http://hdbscan.
> readthedocs.io/en/latest/performance_and_scalability.html
> HDBSCAN and K-means scale well. HDBSCAN will find clusters based on
> density and it also allows for outliers, but can be fiddly to find the
> right parametes. You can not specify the number of clusters (like in Butina
> case). If you want to specify the number of clusters, you can simply use
> K-means. High dimensionality of fingerprints might be a problem for memory
> consumption. In this case you can use PCA to reduce dimensions to something
> manageable. To avoid memory issues with PCA and speed things up I would fit
> the model on random 100k compounds and then just use fit_transform method
> on the rest. http://scikit-learn.org/stable/modules/generated/
> sklearn.decomposition.PCA.html
>
> Cheers,
> Samo
>
> On Sun, Jun 4, 2017 at 9:08 AM, Chris Swain  wrote:
>
>> Hi,
>>
>> I want to do clustering on around 4 million structures
>>
>> The Rdkit cookbook (http://www.rdkit.org/docs/Cookbook.html) suggests
>>
>> "For large sets of molecules (more than 1000-2000), it’s most efficient
>> to use the Butina clustering algorithm”
>>
>>  However it is quite a step up from a few thousand to several million and
>> I wondered if anyone had used this algorithm on larger data sets?
>>
>> As far as I can tell it is not possible to define the number of clusters,
>> is this correct?
>>
>> Cheers,
>>
>> Chris
>>
>> 
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering - visualization?

2016-05-14 Thread Robert DeLisle
Thanks, Curt!  I'll give those a look.  It'll give me a very good reason to
start digging into SciPy a bit more and exploit the added functionality
that will bring.

Regarding my original question and for anyone else that might be
interested...

I did indeed find an answer through a lot of code dredging.  I found the
Murtagh.ClusterData() function in RDKit, and was able to generate clusters
from that.  The function returns a single member list, that single member
being a Cluster object.  I can feed that object to ClusterVis.ClusterToImg
to get the dendrogram I wanted.  Here's a short code snip showing the
pieces.

...
c_tree = Murtagh.ClusterData(dists,nfps,Murtagh.WARDS,isDistData=True)
...
rdkit.ML.Cluster.ClusterVis.ClusterToImg(c_tree[0], size=(500,500),
fileName='test.png')
...

I can then break the cluster tree into subtrees:

...
rdkit.ML.Cluster.ClusterUtils.SplitIntoNClusters(c_tree[0], 5)
...

And I've written a short function to extract out the individual structure
memberships for each group:

...

groups = ClusterUtils.SplitIntoNClusters(c_tree[0], 5)

def GetGroupMembers( grp, memberlist=[] ):
for child in grp.GetChildren():
if (child.GetData() is None ):
GetGroupMembers( child, memberlist )
else:
memberlist.append( child.GetData() )

return memberlist

print GetGroupMembers(groups[0])




On Sat, May 14, 2016 at 11:21 AM, Curt Fischer 
wrote:

> Hi Robert,
>
> For the number of molecules you are interested in, it's viable to use
> SciPy / NumPy clustering functions instead of rdkit's built in C-linked
> functions.  This approach will probably not be as fast rdkit's built-in
> clustering functionalities, and will probably not scale to tens of
> thousands of molecules as well as rdkit's functions, but if you use SciPy
> or NumPy in other types of technical computing, this approach may be more
> transparent, generalizable, and easier to use.
>
> I have an example Jupyter notebook in GitHub that describes what I mean;
> here are the GitHub and nbviewer links:
>
>
> https://github.com/tentrillion/ipython_notebooks/blob/master/chemical_similarity_in_python.ipynb
>
> https://nbviewer.jupyter.org/github/tentrillion/ipython_notebooks/blob/master/chemical_similarity_in_python.ipynb
>
> Here are some of the most important parts of the code for generating a
> dendrogram.
>
> 1. Generate a numpy fingerprint matrix from a list of rdkit Molecules.
>
> for smiles in smiles_list:
> mol = Chem.MolFromSmiles(smiles)
> mols.append(mol)
> fingerprint_mat = np.vstack(np.asarray(rdmolops.RDKFingerprint(mol, fpSize = 
> 2048), dtype = 'bool') for mol in mols)
>
>
> 2. Generate the distance matrix.  *pdist* and *squareform* are from
> *scipy.spatial.distance*.
>
> dist_mat = pdist(fingerprint_mat, 'jaccard') dist_df = pd.DataFrame(
> squareform(dist_mat), index = smiles_list, columns= smiles_list)
>
> As far as I can tell, the Jaccard distance is equivalent to one minus the
> Tanimoto similarity.
>
> 3. Perform hierarchical clustering on the distance matrix and show the
> dendrogram (see the github notebook for the plot). *hc* is
> *scipy.cluster.hierarchy*.
>
> z = hc.linkage(dist_mat)dendrogram = hc.dendrogram(z, labels=dist_df.columns, 
> leaf_rotation=90)plt.show()
>
>
> A helpful page for dendrograms using SciPy is this one:
> https://joernhees.de/blog/2015/08/26/scipy-hierarchical-clustering-and-dendrogram-tutorial/
>
> Good luck!
>
> Curt
>
> On Sat, May 14, 2016 at 9:11 AM, Robert DeLisle 
> wrote:
>
>> Next up is clustering...
>>
>> I've got about 350 structures to cluster and I've worked through the
>> example code from the RDKit Cookbook (
>> http://www.rdkit.org/docs/Cookbook.html#clustering-molecules).  All
>> seems well and good there, but I would like to see the dendrogram.  I see
>> that there is a ClusterVis module to generate images, PDF, and SVG, but all
>> require a Cluster object as input.  I don't find anywhere a description of
>> acquiring or building that object based upon the results of clustering.
>>
>> Any tips?
>>
>> -Kirk
>>
>>
>>
>>
>> --
>> Mobile security can be enabling, not merely restricting. Employees who
>> bring their own devices (BYOD) to work are irked by the imposition of MDM
>> restrictions. Mobile Device Manager Plus allows you to control only the
>> apps on BYO-devices by containerizing them, leaving personal data
>> untouched!
>> https://ad.doubleclick.net/ddm/clk/304595813;131938128;j
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>>
>
--
Mobile security can be enabling, not merely restricting. Employees who
bring their own devices (BYOD) to work are irked by the imposition of MDM

Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-28 Thread Jing Lu
Thanks, Greg,

Yes, sciket learn will automatically promote to arrays of float with
check_array()
function. What I am currently doing is


fpa = numpy.zeros((len(fp),),numpy.double)
DataStructs.ConvertToNumpyArray(fp,fpa)
np.sum(np.reshape(fpa, (4, -1)), axis = 0)


Is this the same as FoldFingerprint()?


Best,Jing



On Fri, Aug 28, 2015 at 5:03 AM, Greg Landrum greg.land...@gmail.com
wrote:

 If that doesn't help (and it may not since some Scikit-Learn functions
 automatically promote their arguments to arrays of doubles), you can always
 just generate a shorter fingerprint from the beginning (all the
 fingerprinting functions take an optional argument for this) or fold the
 existing fingerprints to a new size using the function
 rdkit.DataStructs.FoldFingerprint().

 Best,
 -greg


 On Thu, Aug 27, 2015 at 4:33 PM, Maciek Wójcikowski mac...@wojcikowski.pl
  wrote:

 Hi Jing,

 Most fingerprints are binary, thus can be stored as np.bool_, which
 compared to double should be 64 times more memory efficient.

 Best,
 Maciej

 
 Pozdrawiam,  |  Best regards,
 Maciek Wójcikowski
 mac...@wojcikowski.pl

 2015-08-27 16:15 GMT+02:00 Jing Lu ajin...@gmail.com:

 Hi Greg,

 Thanks! It works! But, is that possible to fold the fingerprint to
 smaller size? np.zeros((100,2048)) still takes a lot of memory...


 Best,
 Jing

 On Wed, Aug 26, 2015 at 11:02 PM, Greg Landrum greg.land...@gmail.com
 wrote:


 On Thu, Aug 27, 2015 at 3:00 AM, Jing Lu ajin...@gmail.com wrote:


 So, I wonder is there any way to convert fingerprint to a numpy vector?


 Indeed there is:

 In [11]: from rdkit import Chem

 In [12]: from rdkit import DataStructs

 In [13]: import numpy

 In [14]: m =Chem.MolFromSmiles('C1CCC1')

 In [15]: fp = Chem.RDKFingerprint(m)

 In [16]: fpa = numpy.zeros((len(fp),),numpy.double)

 In [17]: DataStructs.ConvertToNumpyArray(fp,fpa)


 Best,
 -greg




 --

 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss




--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-28 Thread Maciek Wójcikowski
One small notice from me - I would still use other agregative function
instead of sum to get binary FP:
np.reshape(fpa, (4, -1)).any(axis = 0)
I guess it doesn't change a thing with tanimoto, but if you try other
distances then you can get unexpected results (assuming there are crashes).


Pozdrawiam,  |  Best regards,
Maciek Wójcikowski
mac...@wojcikowski.pl

2015-08-28 17:17 GMT+02:00 Jing Lu ajin...@gmail.com:

 Thanks, Greg,

 Yes, sciket learn will automatically promote to arrays of float with 
 check_array()
 function. What I am currently doing is


 fpa = numpy.zeros((len(fp),),numpy.double)
 DataStructs.ConvertToNumpyArray(fp,fpa)
 np.sum(np.reshape(fpa, (4, -1)), axis = 0)


 Is this the same as FoldFingerprint()?


 Best,Jing



 On Fri, Aug 28, 2015 at 5:03 AM, Greg Landrum greg.land...@gmail.com
 wrote:

 If that doesn't help (and it may not since some Scikit-Learn functions
 automatically promote their arguments to arrays of doubles), you can always
 just generate a shorter fingerprint from the beginning (all the
 fingerprinting functions take an optional argument for this) or fold the
 existing fingerprints to a new size using the function
 rdkit.DataStructs.FoldFingerprint().

 Best,
 -greg


 On Thu, Aug 27, 2015 at 4:33 PM, Maciek Wójcikowski 
 mac...@wojcikowski.pl wrote:

 Hi Jing,

 Most fingerprints are binary, thus can be stored as np.bool_, which
 compared to double should be 64 times more memory efficient.

 Best,
 Maciej

 
 Pozdrawiam,  |  Best regards,
 Maciek Wójcikowski
 mac...@wojcikowski.pl

 2015-08-27 16:15 GMT+02:00 Jing Lu ajin...@gmail.com:

 Hi Greg,

 Thanks! It works! But, is that possible to fold the fingerprint to
 smaller size? np.zeros((100,2048)) still takes a lot of memory...


 Best,
 Jing

 On Wed, Aug 26, 2015 at 11:02 PM, Greg Landrum greg.land...@gmail.com
 wrote:


 On Thu, Aug 27, 2015 at 3:00 AM, Jing Lu ajin...@gmail.com wrote:


 So, I wonder is there any way to convert fingerprint to a numpy
 vector?


 Indeed there is:

 In [11]: from rdkit import Chem

 In [12]: from rdkit import DataStructs

 In [13]: import numpy

 In [14]: m =Chem.MolFromSmiles('C1CCC1')

 In [15]: fp = Chem.RDKFingerprint(m)

 In [16]: fpa = numpy.zeros((len(fp),),numpy.double)

 In [17]: DataStructs.ConvertToNumpyArray(fp,fpa)


 Best,
 -greg




 --

 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss






 --

 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-27 Thread Jing Lu
Hi Greg,

Thanks! It works! But, is that possible to fold the fingerprint to smaller
size? np.zeros((100,2048)) still takes a lot of memory...


Best,
Jing

On Wed, Aug 26, 2015 at 11:02 PM, Greg Landrum greg.land...@gmail.com
wrote:


 On Thu, Aug 27, 2015 at 3:00 AM, Jing Lu ajin...@gmail.com wrote:


 So, I wonder is there any way to convert fingerprint to a numpy vector?


 Indeed there is:

 In [11]: from rdkit import Chem

 In [12]: from rdkit import DataStructs

 In [13]: import numpy

 In [14]: m =Chem.MolFromSmiles('C1CCC1')

 In [15]: fp = Chem.RDKFingerprint(m)

 In [16]: fpa = numpy.zeros((len(fp),),numpy.double)

 In [17]: DataStructs.ConvertToNumpyArray(fp,fpa)


 Best,
 -greg


--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-27 Thread Maciek Wójcikowski
Hi Jing,

Most fingerprints are binary, thus can be stored as np.bool_, which
compared to double should be 64 times more memory efficient.

Best,
Maciej


Pozdrawiam,  |  Best regards,
Maciek Wójcikowski
mac...@wojcikowski.pl

2015-08-27 16:15 GMT+02:00 Jing Lu ajin...@gmail.com:

 Hi Greg,

 Thanks! It works! But, is that possible to fold the fingerprint to smaller
 size? np.zeros((100,2048)) still takes a lot of memory...


 Best,
 Jing

 On Wed, Aug 26, 2015 at 11:02 PM, Greg Landrum greg.land...@gmail.com
 wrote:


 On Thu, Aug 27, 2015 at 3:00 AM, Jing Lu ajin...@gmail.com wrote:


 So, I wonder is there any way to convert fingerprint to a numpy vector?


 Indeed there is:

 In [11]: from rdkit import Chem

 In [12]: from rdkit import DataStructs

 In [13]: import numpy

 In [14]: m =Chem.MolFromSmiles('C1CCC1')

 In [15]: fp = Chem.RDKFingerprint(m)

 In [16]: fpa = numpy.zeros((len(fp),),numpy.double)

 In [17]: DataStructs.ConvertToNumpyArray(fp,fpa)


 Best,
 -greg




 --

 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-26 Thread Greg Landrum
On Thu, Aug 27, 2015 at 3:00 AM, Jing Lu ajin...@gmail.com wrote:


 So, I wonder is there any way to convert fingerprint to a numpy vector?


Indeed there is:

In [11]: from rdkit import Chem

In [12]: from rdkit import DataStructs

In [13]: import numpy

In [14]: m =Chem.MolFromSmiles('C1CCC1')

In [15]: fp = Chem.RDKFingerprint(m)

In [16]: fpa = numpy.zeros((len(fp),),numpy.double)

In [17]: DataStructs.ConvertToNumpyArray(fp,fpa)


Best,
-greg
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Dimitri Maziuk
On 08/23/2015 11:38 AM, Jing Lu wrote:
 Thanks, Andrew!
 
 Yes, I was thinking about using scikit-learn also. But I guess I need to
 use a data structure for sparse matrix and define a function for
 connectivity. I hope the memory issue won't be a problem.
 Most AgglomerativeClustering algorithms have time complexity with N^2. Will
 that be a problem?

Usual programming solutions are
- if you don't need the whole matrix in RAM at once, cache it to disk.
Otherwise try to split the job into smaller batches.
- Big-Oh notation is relative complexity. In absolute terms, if it
finishes overnight and you only intend to run it a handful of times, N^2
is not worth worrying about. Otherwise try to split into smaller batches
that you can run in parallel on a cluster of computers.

FWIW
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Jing Lu
Thanks, Takayuki,

For both Repeated Bisection clustering and K-means clustering, they all
need the number of clusters as input, right?


Best,
Jing

On Sun, Aug 23, 2015 at 1:17 AM, Taka Seri serit...@gmail.com wrote:

 Dear Jing,

 How about your trying using bayon ?
 https://code.google.com/p/bayon/
 It's not function of RDKit, but I think the library can cluster molecules
 using ECFP4.

 Unfortunately, input file format of bayon is not distance matrix but easy
 to prepare the format.

 Best regards.

 Takayuki


 2015年8月23日(日) 12:03 Jing Lu ajin...@gmail.com:

 Currently, I prefer fingerprint based clustering, because it's hard to
 set the cutoff for scaffold based clustering. Does RDKit have scaffold
 based clustering?

 On Sat, Aug 22, 2015 at 10:56 PM, abhik1...@gmail.com wrote:

 Hi, how about scaffold based clustering . You extract the scaffolds and
 then cluster it and then put the respective scaffold compounds inside the
 cluster .

 Sent from my iPhone

  On Aug 22, 2015, at 8:43 PM, Jing Lu ajin...@gmail.com wrote:
 
  Dear RDKit users,
 
  If I want to cluster more than 1M molecules by ECFP4. How could I do
 it? If I calculate the distance between every pair of molecules, the size
 of distance matrix will be too big. Does RDKit support any heuristic
 clustering algorithm without calculating the distance matrix of the whole
 library?
 
 
 
  Thanks,
  Jing
 
 --
  ___
  Rdkit-discuss mailing list
  Rdkit-discuss@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



 --
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Andrew Dalke
On Aug 23, 2015, at 3:43 AM, Jing Lu wrote:
 If I want to cluster more than 1M molecules by ECFP4. How could I do it? If I 
 calculate the distance between every pair of molecules, the size of distance 
 matrix will be too big. Does RDKit support any heuristic clustering algorithm 
 without calculating the distance matrix of the whole library?

You should look to a third-party package, like scikit-learn from 
http://scikit-learn.org/ , for clustering. That has a very extensive set of 
clustering algorithms, including k-means at 
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html .

Though you may be interested in the note on that page: For large scale 
learning (say n_samples  10k) MiniBatchKMeans is probably much faster to than 
the default batch implementation. 
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html

My memory is that k-means depends on a Euclidean distance between the records, 
which is different from the usual Tanimoto (or metric-like 1-Tanimoto) in 
cheminformatics. 

If you would rather use Tanimoto, then perhaps try a method like 
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
 ?

If you go that route, and you want to build the full 1M x 1M distance matrix, 
the usual approach is to ignore similarities below a given threshold T (e.g., 
T0.8). This can be thought of as either setting those entries to 0.0, or 
specifying an ignore flag. In either case, the result can be stored in a 
sparse matrix, which is efficient at storing only the data of interest.

Using my package, chemfp, from http://chemfp.com/ you can compute a sparse 
matrix for 1M x 1M fingerprints in about an hour using a laptop or desktop.

The question would then be how to adapt the parse output format from chemfp to 
the sparse input format for your clustering method of choice.

Best regards,

Andrew
da...@dalkescientific.com



--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Andrew Dalke
On Aug 23, 2015, at 6:38 PM, Jing Lu wrote:
 I hope the memory issue won't be a problem.

That's up to you and your choice of threshold.

  Most AgglomerativeClustering algorithms have time complexity with N^2. Will 
 that be a problem?

You have to decided for yourself what counts as a problem. If you want to get 
it done in 1 minute with a threshold of 0.2, then you've got a problem. If 
you're willing to take a month, then there's no problem.

With chemfp, Taylor-Butina clustering, at 
http://chemfp.readthedocs.org/en/latest/using-api.html#taylor-butina-clustering 
, took 35 seconds for 100,000 fingers. The NxN calculation is also N^2 time, so 
should only take about an hour for 1 million fingerprints.

Best of course is to start with a smaller system first, see if it works, and 
only then try to scale up. Then you'll have experience of which methods are 
appropriate and what your time constraints are.


Andrew
da...@dalkescientific.com



--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-22 Thread Taka Seri
Dear Jing,

How about your trying using bayon ?
https://code.google.com/p/bayon/
It's not function of RDKit, but I think the library can cluster molecules
using ECFP4.

Unfortunately, input file format of bayon is not distance matrix but easy
to prepare the format.

Best regards.

Takayuki


2015年8月23日(日) 12:03 Jing Lu ajin...@gmail.com:

 Currently, I prefer fingerprint based clustering, because it's hard to set
 the cutoff for scaffold based clustering. Does RDKit have scaffold based
 clustering?

 On Sat, Aug 22, 2015 at 10:56 PM, abhik1...@gmail.com wrote:

 Hi, how about scaffold based clustering . You extract the scaffolds and
 then cluster it and then put the respective scaffold compounds inside the
 cluster .

 Sent from my iPhone

  On Aug 22, 2015, at 8:43 PM, Jing Lu ajin...@gmail.com wrote:
 
  Dear RDKit users,
 
  If I want to cluster more than 1M molecules by ECFP4. How could I do
 it? If I calculate the distance between every pair of molecules, the size
 of distance matrix will be too big. Does RDKit support any heuristic
 clustering algorithm without calculating the distance matrix of the whole
 library?
 
 
 
  Thanks,
  Jing
 
 --
  ___
  Rdkit-discuss mailing list
  Rdkit-discuss@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



 --
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering in RDKit (take 2 - missing wiki link)

2015-04-14 Thread JP
This is now at:
https://github.com/rdkit/rdkit/blob/master/Docs/Book/Cookbook.rst

-
Jean-Paul Ebejer
Early Stage Researcher

On 11 April 2015 at 10:46, JP jeanpaul.ebe...@inhibox.com wrote:

 Hi RDKitters!

 I have a bit of python RDKit clustering code using Butina which is
 commented with:
 # Ripped off from https://code.google.com/p/rdkit/wiki/ClusteringMolecules

 Sadly, and as it happens I need to refer back to this.

 This was written by Greg, I think as a result of this nudge:

 https://www.mail-archive.com/rdkit-discuss@lists.sourceforge.net/msg02449.html

 This page doesn't seem to exist anymore in the WIki.  Is this because of a
 technical / administrative glitch?  Or has this been removed purposefully
 (perhaps the functionality is not supported anymore) ?

 Thanks!

 -
 Jean-Paul Ebejer
 Early Stage Researcher

--
BPM Camp - Free Virtual Workshop May 6th at 10am PDT/1PM EDT
Develop your own process in accordance with the BPMN 2 standard
Learn Process modeling best practices with Bonita BPM through live exercises
http://www.bonitasoft.com/be-part-of-it/events/bpm-camp-virtual- event?utm_
source=Sourceforge_BPM_Camp_5_6_15utm_medium=emailutm_campaign=VA_SF___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering functions in Java API

2015-02-23 Thread Maciek Wójcikowski
Hello,

If interested in clustering in python I can recommend, as usual, sklearn:
http://scikit-learn.org/stable/modules/clustering.html
It's pretty much all you should need. Have fun!


Pozdrawiam,  |  Best regards,
Maciek Wójcikowski
mac...@wojcikowski.pl

2015-02-23 11:43 GMT+01:00 Anthony Bradley anthony.brad...@worc.ox.ac.uk:

   Hi Anthony,



 On Sun, Feb 22, 2015 at 11:03 AM, Anthony Bradley 
 anthony.brad...@worc.ox.ac.uk wrote:

 Hi all,

 I am currently working with RDKit from the Java API (well jython actually).

 As has been discussed most of the documentation for this is found by
 trawling:

 Code/JavaWrappers/gmwrapper/src-test/org/RDKit/
 and
 Code/JavaWrappers/gmwrapper/src/org/RDKit/

 However I'm trying to perform a simple clustering. I can build my distance
 matrix - but I can't see where the actual clustering algorithms live.

 It may well be my grepping skills are not what they should be!



 No need to have any concerns about your skills with grep, the clustering
 functionality is not exposed via the SWIG wrappers. As currently configured
 the code isn't available as a library, it's really only useable from
 python. It's a medium-sized amount of work to convert this to a library, so
 it's doable, but I'm not sure it's worth it.



 That seems fair enough and there are definitely other options out there.
 It was more of method consistency thing – so I could be using the same code
 from the python / jython side.



 I've been assuming that there are high(er) quality replacements available
 for most of the RDKit machine learning functionality. Since it's somewhat
 removed from the cheminformatics focus, I haven't really put any time
 into that code in the past few years. Does this sound wrong to anyone? Any
 arguments that the clustering code is worth investing some time in?



 Unless anybody else is interested – I can see why it would be low
 priority!



 -greg



 Thanks a lot for responding so quickly and effectively!



 Best,



 Anthony




 --
 Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
 from Actuate! Instantly Supercharge Your Business Reports and Dashboards
 with Interactivity, Sharing, Native Excel Exports, App Integration  more
 Get technology previously reserved for billion-dollar corporations, FREE

 http://pubads.g.doubleclick.net/gampad/clk?id=190641631iu=/4140/ostg.clktrk
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration  more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631iu=/4140/ostg.clktrk___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering functions in Java API

2015-02-23 Thread Patrick Walters
I agree that there are plenty of implementations of clustering, machine
learning, etc.  It would be better for the RDKit developers to focus on
cheminformatics.   This being said, there are some opportunities for domain
specific performance enhancement.  One of the slow steps in many clustering
algorithms is the calculation of a distance matrix and identification of
neighbors.  If you're clustering fingerprints, I'd recommend looking at Andrew
Dalke's ChemFP http://chemfp.com/.  Andrew has applied a multitude of
tricks that can make clustering blazingly fast.   The ChemFP examples
include an implementation of Taylor-Butina clustering.  Even better, ChemFP
works out of the box with the RDKit.

Pat



On Mon, Feb 23, 2015 at 7:02 AM, Maciek Wójcikowski mac...@wojcikowski.pl
wrote:

 Hello,

 If interested in clustering in python I can recommend, as usual, sklearn:
 http://scikit-learn.org/stable/modules/clustering.html
 It's pretty much all you should need. Have fun!

 
 Pozdrawiam,  |  Best regards,
 Maciek Wójcikowski
 mac...@wojcikowski.pl

 2015-02-23 11:43 GMT+01:00 Anthony Bradley anthony.brad...@worc.ox.ac.uk
 :

   Hi Anthony,



 On Sun, Feb 22, 2015 at 11:03 AM, Anthony Bradley 
 anthony.brad...@worc.ox.ac.uk wrote:

 Hi all,

 I am currently working with RDKit from the Java API (well jython
 actually).

 As has been discussed most of the documentation for this is found by
 trawling:

 Code/JavaWrappers/gmwrapper/src-test/org/RDKit/
 and
 Code/JavaWrappers/gmwrapper/src/org/RDKit/

 However I'm trying to perform a simple clustering. I can build my
 distance matrix - but I can't see where the actual clustering algorithms
 live.

 It may well be my grepping skills are not what they should be!



 No need to have any concerns about your skills with grep, the clustering
 functionality is not exposed via the SWIG wrappers. As currently configured
 the code isn't available as a library, it's really only useable from
 python. It's a medium-sized amount of work to convert this to a library, so
 it's doable, but I'm not sure it's worth it.



 That seems fair enough and there are definitely other options out there.
 It was more of method consistency thing – so I could be using the same code
 from the python / jython side.



 I've been assuming that there are high(er) quality replacements available
 for most of the RDKit machine learning functionality. Since it's somewhat
 removed from the cheminformatics focus, I haven't really put any time
 into that code in the past few years. Does this sound wrong to anyone? Any
 arguments that the clustering code is worth investing some time in?



 Unless anybody else is interested – I can see why it would be low
 priority!



 -greg



 Thanks a lot for responding so quickly and effectively!



 Best,



 Anthony




 --
 Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
 from Actuate! Instantly Supercharge Your Business Reports and Dashboards
 with Interactivity, Sharing, Native Excel Exports, App Integration  more
 Get technology previously reserved for billion-dollar corporations, FREE

 http://pubads.g.doubleclick.net/gampad/clk?id=190641631iu=/4140/ostg.clktrk
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss




 --
 Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
 from Actuate! Instantly Supercharge Your Business Reports and Dashboards
 with Interactivity, Sharing, Native Excel Exports, App Integration  more
 Get technology previously reserved for billion-dollar corporations, FREE

 http://pubads.g.doubleclick.net/gampad/clk?id=190641631iu=/4140/ostg.clktrk
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration  more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631iu=/4140/ostg.clktrk___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering functions in Java API

2015-02-23 Thread Greg Landrum
On Mon, Feb 23, 2015 at 4:24 PM, Patrick Walters wpwalt...@gmail.com
wrote:

 I agree that there are plenty of implementations of clustering, machine
 learning, etc.  It would be better for the RDKit developers to focus on
 cheminformatics.   This being said, there are some opportunities for domain
 specific performance enhancement.  One of the slow steps in many clustering
 algorithms is the calculation of a distance matrix and identification of
 neighbors.  If you're clustering fingerprints, I'd recommend looking at Andrew
 Dalke's ChemFP http://chemfp.com/.  Andrew has applied a multitude of
 tricks that can make clustering blazingly fast.   The ChemFP examples
 include an implementation of Taylor-Butina clustering.  Even better, ChemFP
 works out of the box with the RDKit.


Actually, coupling chemfp to a hierarchical clustering algorithm like the
Murtagh code the RDKit includes would be pretty cool...

-greg
--
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration  more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631iu=/4140/ostg.clktrk___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering functions in Java API

2015-02-23 Thread Greg Landrum
Hi Anthony,

On Sun, Feb 22, 2015 at 11:03 AM, Anthony Bradley 
anthony.brad...@worc.ox.ac.uk wrote:

 Hi all,

 I am currently working with RDKit from the Java API (well jython actually).

 As has been discussed most of the documentation for this is found by
 trawling:

 Code/JavaWrappers/gmwrapper/src-test/org/RDKit/
 and
 Code/JavaWrappers/gmwrapper/src/org/RDKit/

 However I'm trying to perform a simple clustering. I can build my distance
 matrix - but I can't see where the actual clustering algorithms live.

 It may well be my grepping skills are not what they should be!


No need to have any concerns about your skills with grep, the clustering
functionality is not exposed via the SWIG wrappers. As currently configured
the code isn't available as a library, it's really only useable from
python. It's a medium-sized amount of work to convert this to a library, so
it's doable, but I'm not sure it's worth it.

I've been assuming that there are high(er) quality replacements available
for most of the RDKit machine learning functionality. Since it's somewhat
removed from the cheminformatics focus, I haven't really put any time
into that code in the past few years. Does this sound wrong to anyone? Any
arguments that the clustering code is worth investing some time in?

-greg
--
Download BIRT iHub F-Type - The Free Enterprise-Grade BIRT Server
from Actuate! Instantly Supercharge Your Business Reports and Dashboards
with Interactivity, Sharing, Native Excel Exports, App Integration  more
Get technology previously reserved for billion-dollar corporations, FREE
http://pubads.g.doubleclick.net/gampad/clk?id=190641631iu=/4140/ostg.clktrk___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss