Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-28 Thread Jing Lu
Thanks, Greg,

Yes, sciket learn will automatically promote to arrays of float with
check_array()
function. What I am currently doing is


fpa = numpy.zeros((len(fp),),numpy.double)
DataStructs.ConvertToNumpyArray(fp,fpa)
np.sum(np.reshape(fpa, (4, -1)), axis = 0)


Is this the same as FoldFingerprint()?


Best,Jing



On Fri, Aug 28, 2015 at 5:03 AM, Greg Landrum greg.land...@gmail.com
wrote:

 If that doesn't help (and it may not since some Scikit-Learn functions
 automatically promote their arguments to arrays of doubles), you can always
 just generate a shorter fingerprint from the beginning (all the
 fingerprinting functions take an optional argument for this) or fold the
 existing fingerprints to a new size using the function
 rdkit.DataStructs.FoldFingerprint().

 Best,
 -greg


 On Thu, Aug 27, 2015 at 4:33 PM, Maciek Wójcikowski mac...@wojcikowski.pl
  wrote:

 Hi Jing,

 Most fingerprints are binary, thus can be stored as np.bool_, which
 compared to double should be 64 times more memory efficient.

 Best,
 Maciej

 
 Pozdrawiam,  |  Best regards,
 Maciek Wójcikowski
 mac...@wojcikowski.pl

 2015-08-27 16:15 GMT+02:00 Jing Lu ajin...@gmail.com:

 Hi Greg,

 Thanks! It works! But, is that possible to fold the fingerprint to
 smaller size? np.zeros((100,2048)) still takes a lot of memory...


 Best,
 Jing

 On Wed, Aug 26, 2015 at 11:02 PM, Greg Landrum greg.land...@gmail.com
 wrote:


 On Thu, Aug 27, 2015 at 3:00 AM, Jing Lu ajin...@gmail.com wrote:


 So, I wonder is there any way to convert fingerprint to a numpy vector?


 Indeed there is:

 In [11]: from rdkit import Chem

 In [12]: from rdkit import DataStructs

 In [13]: import numpy

 In [14]: m =Chem.MolFromSmiles('C1CCC1')

 In [15]: fp = Chem.RDKFingerprint(m)

 In [16]: fpa = numpy.zeros((len(fp),),numpy.double)

 In [17]: DataStructs.ConvertToNumpyArray(fp,fpa)


 Best,
 -greg




 --

 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss




--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-28 Thread Maciek Wójcikowski
One small notice from me - I would still use other agregative function
instead of sum to get binary FP:
np.reshape(fpa, (4, -1)).any(axis = 0)
I guess it doesn't change a thing with tanimoto, but if you try other
distances then you can get unexpected results (assuming there are crashes).


Pozdrawiam,  |  Best regards,
Maciek Wójcikowski
mac...@wojcikowski.pl

2015-08-28 17:17 GMT+02:00 Jing Lu ajin...@gmail.com:

 Thanks, Greg,

 Yes, sciket learn will automatically promote to arrays of float with 
 check_array()
 function. What I am currently doing is


 fpa = numpy.zeros((len(fp),),numpy.double)
 DataStructs.ConvertToNumpyArray(fp,fpa)
 np.sum(np.reshape(fpa, (4, -1)), axis = 0)


 Is this the same as FoldFingerprint()?


 Best,Jing



 On Fri, Aug 28, 2015 at 5:03 AM, Greg Landrum greg.land...@gmail.com
 wrote:

 If that doesn't help (and it may not since some Scikit-Learn functions
 automatically promote their arguments to arrays of doubles), you can always
 just generate a shorter fingerprint from the beginning (all the
 fingerprinting functions take an optional argument for this) or fold the
 existing fingerprints to a new size using the function
 rdkit.DataStructs.FoldFingerprint().

 Best,
 -greg


 On Thu, Aug 27, 2015 at 4:33 PM, Maciek Wójcikowski 
 mac...@wojcikowski.pl wrote:

 Hi Jing,

 Most fingerprints are binary, thus can be stored as np.bool_, which
 compared to double should be 64 times more memory efficient.

 Best,
 Maciej

 
 Pozdrawiam,  |  Best regards,
 Maciek Wójcikowski
 mac...@wojcikowski.pl

 2015-08-27 16:15 GMT+02:00 Jing Lu ajin...@gmail.com:

 Hi Greg,

 Thanks! It works! But, is that possible to fold the fingerprint to
 smaller size? np.zeros((100,2048)) still takes a lot of memory...


 Best,
 Jing

 On Wed, Aug 26, 2015 at 11:02 PM, Greg Landrum greg.land...@gmail.com
 wrote:


 On Thu, Aug 27, 2015 at 3:00 AM, Jing Lu ajin...@gmail.com wrote:


 So, I wonder is there any way to convert fingerprint to a numpy
 vector?


 Indeed there is:

 In [11]: from rdkit import Chem

 In [12]: from rdkit import DataStructs

 In [13]: import numpy

 In [14]: m =Chem.MolFromSmiles('C1CCC1')

 In [15]: fp = Chem.RDKFingerprint(m)

 In [16]: fpa = numpy.zeros((len(fp),),numpy.double)

 In [17]: DataStructs.ConvertToNumpyArray(fp,fpa)


 Best,
 -greg




 --

 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss






 --

 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-27 Thread Jing Lu
Hi Greg,

Thanks! It works! But, is that possible to fold the fingerprint to smaller
size? np.zeros((100,2048)) still takes a lot of memory...


Best,
Jing

On Wed, Aug 26, 2015 at 11:02 PM, Greg Landrum greg.land...@gmail.com
wrote:


 On Thu, Aug 27, 2015 at 3:00 AM, Jing Lu ajin...@gmail.com wrote:


 So, I wonder is there any way to convert fingerprint to a numpy vector?


 Indeed there is:

 In [11]: from rdkit import Chem

 In [12]: from rdkit import DataStructs

 In [13]: import numpy

 In [14]: m =Chem.MolFromSmiles('C1CCC1')

 In [15]: fp = Chem.RDKFingerprint(m)

 In [16]: fpa = numpy.zeros((len(fp),),numpy.double)

 In [17]: DataStructs.ConvertToNumpyArray(fp,fpa)


 Best,
 -greg


--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-27 Thread Maciek Wójcikowski
Hi Jing,

Most fingerprints are binary, thus can be stored as np.bool_, which
compared to double should be 64 times more memory efficient.

Best,
Maciej


Pozdrawiam,  |  Best regards,
Maciek Wójcikowski
mac...@wojcikowski.pl

2015-08-27 16:15 GMT+02:00 Jing Lu ajin...@gmail.com:

 Hi Greg,

 Thanks! It works! But, is that possible to fold the fingerprint to smaller
 size? np.zeros((100,2048)) still takes a lot of memory...


 Best,
 Jing

 On Wed, Aug 26, 2015 at 11:02 PM, Greg Landrum greg.land...@gmail.com
 wrote:


 On Thu, Aug 27, 2015 at 3:00 AM, Jing Lu ajin...@gmail.com wrote:


 So, I wonder is there any way to convert fingerprint to a numpy vector?


 Indeed there is:

 In [11]: from rdkit import Chem

 In [12]: from rdkit import DataStructs

 In [13]: import numpy

 In [14]: m =Chem.MolFromSmiles('C1CCC1')

 In [15]: fp = Chem.RDKFingerprint(m)

 In [16]: fpa = numpy.zeros((len(fp),),numpy.double)

 In [17]: DataStructs.ConvertToNumpyArray(fp,fpa)


 Best,
 -greg




 --

 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-26 Thread Greg Landrum
On Thu, Aug 27, 2015 at 3:00 AM, Jing Lu ajin...@gmail.com wrote:


 So, I wonder is there any way to convert fingerprint to a numpy vector?


Indeed there is:

In [11]: from rdkit import Chem

In [12]: from rdkit import DataStructs

In [13]: import numpy

In [14]: m =Chem.MolFromSmiles('C1CCC1')

In [15]: fp = Chem.RDKFingerprint(m)

In [16]: fpa = numpy.zeros((len(fp),),numpy.double)

In [17]: DataStructs.ConvertToNumpyArray(fp,fpa)


Best,
-greg
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Dimitri Maziuk
On 08/23/2015 11:38 AM, Jing Lu wrote:
 Thanks, Andrew!
 
 Yes, I was thinking about using scikit-learn also. But I guess I need to
 use a data structure for sparse matrix and define a function for
 connectivity. I hope the memory issue won't be a problem.
 Most AgglomerativeClustering algorithms have time complexity with N^2. Will
 that be a problem?

Usual programming solutions are
- if you don't need the whole matrix in RAM at once, cache it to disk.
Otherwise try to split the job into smaller batches.
- Big-Oh notation is relative complexity. In absolute terms, if it
finishes overnight and you only intend to run it a handful of times, N^2
is not worth worrying about. Otherwise try to split into smaller batches
that you can run in parallel on a cluster of computers.

FWIW
-- 
Dimitri Maziuk
Programmer/sysadmin
BioMagResBank, UW-Madison -- http://www.bmrb.wisc.edu



signature.asc
Description: OpenPGP digital signature
--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Jing Lu
Thanks, Takayuki,

For both Repeated Bisection clustering and K-means clustering, they all
need the number of clusters as input, right?


Best,
Jing

On Sun, Aug 23, 2015 at 1:17 AM, Taka Seri serit...@gmail.com wrote:

 Dear Jing,

 How about your trying using bayon ?
 https://code.google.com/p/bayon/
 It's not function of RDKit, but I think the library can cluster molecules
 using ECFP4.

 Unfortunately, input file format of bayon is not distance matrix but easy
 to prepare the format.

 Best regards.

 Takayuki


 2015年8月23日(日) 12:03 Jing Lu ajin...@gmail.com:

 Currently, I prefer fingerprint based clustering, because it's hard to
 set the cutoff for scaffold based clustering. Does RDKit have scaffold
 based clustering?

 On Sat, Aug 22, 2015 at 10:56 PM, abhik1...@gmail.com wrote:

 Hi, how about scaffold based clustering . You extract the scaffolds and
 then cluster it and then put the respective scaffold compounds inside the
 cluster .

 Sent from my iPhone

  On Aug 22, 2015, at 8:43 PM, Jing Lu ajin...@gmail.com wrote:
 
  Dear RDKit users,
 
  If I want to cluster more than 1M molecules by ECFP4. How could I do
 it? If I calculate the distance between every pair of molecules, the size
 of distance matrix will be too big. Does RDKit support any heuristic
 clustering algorithm without calculating the distance matrix of the whole
 library?
 
 
 
  Thanks,
  Jing
 
 --
  ___
  Rdkit-discuss mailing list
  Rdkit-discuss@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



 --
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Andrew Dalke
On Aug 23, 2015, at 3:43 AM, Jing Lu wrote:
 If I want to cluster more than 1M molecules by ECFP4. How could I do it? If I 
 calculate the distance between every pair of molecules, the size of distance 
 matrix will be too big. Does RDKit support any heuristic clustering algorithm 
 without calculating the distance matrix of the whole library?

You should look to a third-party package, like scikit-learn from 
http://scikit-learn.org/ , for clustering. That has a very extensive set of 
clustering algorithms, including k-means at 
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.KMeans.html .

Though you may be interested in the note on that page: For large scale 
learning (say n_samples  10k) MiniBatchKMeans is probably much faster to than 
the default batch implementation. 
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.MiniBatchKMeans.html

My memory is that k-means depends on a Euclidean distance between the records, 
which is different from the usual Tanimoto (or metric-like 1-Tanimoto) in 
cheminformatics. 

If you would rather use Tanimoto, then perhaps try a method like 
http://scikit-learn.org/stable/modules/generated/sklearn.cluster.AgglomerativeClustering.html
 ?

If you go that route, and you want to build the full 1M x 1M distance matrix, 
the usual approach is to ignore similarities below a given threshold T (e.g., 
T0.8). This can be thought of as either setting those entries to 0.0, or 
specifying an ignore flag. In either case, the result can be stored in a 
sparse matrix, which is efficient at storing only the data of interest.

Using my package, chemfp, from http://chemfp.com/ you can compute a sparse 
matrix for 1M x 1M fingerprints in about an hour using a laptop or desktop.

The question would then be how to adapt the parse output format from chemfp to 
the sparse input format for your clustering method of choice.

Best regards,

Andrew
da...@dalkescientific.com



--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-23 Thread Andrew Dalke
On Aug 23, 2015, at 6:38 PM, Jing Lu wrote:
 I hope the memory issue won't be a problem.

That's up to you and your choice of threshold.

  Most AgglomerativeClustering algorithms have time complexity with N^2. Will 
 that be a problem?

You have to decided for yourself what counts as a problem. If you want to get 
it done in 1 minute with a threshold of 0.2, then you've got a problem. If 
you're willing to take a month, then there's no problem.

With chemfp, Taylor-Butina clustering, at 
http://chemfp.readthedocs.org/en/latest/using-api.html#taylor-butina-clustering 
, took 35 seconds for 100,000 fingers. The NxN calculation is also N^2 time, so 
should only take about an hour for 1 million fingerprints.

Best of course is to start with a smaller system first, see if it works, and 
only then try to scale up. Then you'll have experience of which methods are 
appropriate and what your time constraints are.


Andrew
da...@dalkescientific.com



--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering 1M molecules

2015-08-22 Thread Taka Seri
Dear Jing,

How about your trying using bayon ?
https://code.google.com/p/bayon/
It's not function of RDKit, but I think the library can cluster molecules
using ECFP4.

Unfortunately, input file format of bayon is not distance matrix but easy
to prepare the format.

Best regards.

Takayuki


2015年8月23日(日) 12:03 Jing Lu ajin...@gmail.com:

 Currently, I prefer fingerprint based clustering, because it's hard to set
 the cutoff for scaffold based clustering. Does RDKit have scaffold based
 clustering?

 On Sat, Aug 22, 2015 at 10:56 PM, abhik1...@gmail.com wrote:

 Hi, how about scaffold based clustering . You extract the scaffolds and
 then cluster it and then put the respective scaffold compounds inside the
 cluster .

 Sent from my iPhone

  On Aug 22, 2015, at 8:43 PM, Jing Lu ajin...@gmail.com wrote:
 
  Dear RDKit users,
 
  If I want to cluster more than 1M molecules by ECFP4. How could I do
 it? If I calculate the distance between every pair of molecules, the size
 of distance matrix will be too big. Does RDKit support any heuristic
 clustering algorithm without calculating the distance matrix of the whole
 library?
 
 
 
  Thanks,
  Jing
 
 --
  ___
  Rdkit-discuss mailing list
  Rdkit-discuss@lists.sourceforge.net
  https://lists.sourceforge.net/lists/listinfo/rdkit-discuss



 --
 ___
 Rdkit-discuss mailing list
 Rdkit-discuss@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

--
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss