[Rdkit-discuss] Does rdkit depend on pandas?

2017-06-06 Thread Michał Nowotka
Hi,

I just upgraded rdkit from 2017.03.1 to 2017.03.2 using Conda. What I
have noticed is that pandas are now installed during the installation
of rdkit.
Does rdkit depend on pandas now? Is it safe to remove it? If it works
without pandas, maybe it makes sense to remove the dependency.

Kind regards,

Michał Nowotka

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Does rdkit depend on pandas?

2017-06-06 Thread Brian Kelley
No.  The main reason that the conda recipe includes pandas is for testing
the pandas extension.  We could probably remove it from the run-time
dependency however and let the user install it in addition.

In any case, feel free to remove pandas from the conda installation.

Cheers,
 Brian

On Tue, Jun 6, 2017 at 9:54 AM, Michał Nowotka  wrote:

> Hi,
>
> I just upgraded rdkit from 2017.03.1 to 2017.03.2 using Conda. What I
> have noticed is that pandas are now installed during the installation
> of rdkit.
> Does rdkit depend on pandas now? Is it safe to remove it? If it works
> without pandas, maybe it makes sense to remove the dependency.
>
> Kind regards,
>
> Michał Nowotka
>
> 
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Clustering

2017-06-06 Thread Abhik Seal
Hello all ,

How about doing some dimension reduction using  pca or Tsne and then run
clustering using some selected top components like top 20 and I think then
the clustering would be fast .

Thanks
Abhik

On Mon, Jun 5, 2017 at 6:11 AM David Cosgrove 
wrote:

> Hi,
> I have used this algorithm for many years clustering sets of several
> millions of compounds.  Indeed, I am old enough to know it as the Taylor
> algorithm.  It is slow but reliable.  A crucial setting is the similarity
> threshold for the clusters, which dictates the size of the neighbour lists
> and hence the amount of RAM required.  It also, of course, determines the
> quality of the clusters.  My implementation is at
> https://github.com/OpenEye-Contrib/Flush.git.  This repo has a number of
> programs of relevance, the one you want is called cluster.  I have just
> confirmed that it compiles on ubuntu 16.  It needs the fingerprints as
> ascii bitstrings, I don't have code for turning RDKit fingerprints into
> this format, but I would imagine it's quite straightforward.  The program
> runs in parallel using OpenMPI.  That's valuable for two reasons.  One is
> speed, but the more important one is memory use.  If you can spread the
> slave processes over several machines you can cluster much larger sets of
> molecules as you are effectively expanding the RAM of the machine.  When I
> wrote the original, 64MB was a lot of RAM, it is less of an issue these
> days but still matters if clustering millions of fingerprints.  Note that
> the program cluster doesn't ever store the distance matrix, just the lists
> of neighbours for each molecule within the threshold.  This reduces the
> memory footprint substantially if you have a tight-enough cluster threshold.
> HTH,
> Dave
>
>
>
> On Mon, Jun 5, 2017 at 11:22 AM, Nils Weskamp 
> wrote:
>
>> Hi Michal,
>>
>> I have done this a couple of times for compound sets up to 10M+ using a
>> simplified variant of the Taylor-Butina algorithm. The overall run time
>> was in the range of hours to a few days (which could probably be
>> optimized, but was fast enough for me).
>>
>> As you correctly mentioned, getting the (sparse) similarity matrix is
>> fairly simple (and can be done in parallel on a cluster). Unfortunately,
>> this matrix gets very large (even the sparse version). Most clustering
>> algorithms require random access to the matrix, so you have to keep it
>> in main memory (which then has to be huge) or calculate it on-the-fly
>> (takes forever).
>>
>> My implementation (in C++, not sure if I can share it) assumes that the
>> similarity matrix has been pre-calculated and is stored in one (or
>> multiple) files. It reads these files sequentially and whenever a
>> compound pair with a similarity beyond the threshold is found, it checks
>> whether one of the cpds. is already a centroid (in which case the other
>> is assigned to it). Otherwise, one of the compounds is randomly chosen
>> as centroid and the other is assigned to it.
>>
>> This procedure is highly order-dependent and thus not optimal, but has
>> to read the whole similarity matrix only once and has limited memory
>> consumption (you only need to keep a list of centroids). If you still
>> run into memory issues, you can start by clustering with a high
>> similarity threshold and then re-cluster centroids and singletons on a
>> lower threshold level.
>>
>> I also played around with DBSCAN for large compound databases, but (as
>> previously mentioned by Samo) found it difficult to find the right
>> parameters and ended up with a single huge cluster covering 90 percent
>> of the database in many cases.
>>
>> Hope this helps,
>> Nils
>>
>> Am 05.06.2017 um 11:02 schrieb Michał Nowotka:
>> > Is there anyone who actually done this: clustered >2M compounds using
>> > any well-known clustering algorithm and is willing to share a code and
>> > some performance statistics?
>>
>>
>>
>> --
>> Check out the vibrant tech community on one of the world's most
>> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
>> ___
>> Rdkit-discuss mailing list
>> Rdkit-discuss@lists.sourceforge.net
>> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>>
>
>
>
> --
> David Cosgrove
> Freelance computational chemistry and chemoinformatics developer
> http://cozchemix.co.uk
>
>
> --
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> ___
> Rdkit-discuss mailing list
> Rdkit-discuss@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
-- 

Cheers,
Abhik Seal  Ph.D. (Cheminformatics)

Re: [Rdkit-discuss] RDkit Molecule Fragmenter

2017-06-06 Thread Greg Landrum
On Tue, Jun 6, 2017 at 1:43 PM, Popov, Maxim (Ext)  wrote:

>
>
> I have discovered a very usefule tool in Knime, Molecule Fragmenter by
> RDKit, but can’t find a corresponding class or function outside of Knime.
> Can I use the Fragmenter without Knime?
>
>
>
Yep:
http://www.rdkit.org/docs/GettingStartedInPython.html#molecular-fragments

There's also:
http://www.rdkit.org/docs/GettingStartedInPython.html#brics-implementation
and
http://www.rdkit.org/docs/GettingStartedInPython.html#recap-implementation
which use different strategies.

-greg
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] Rdkit-discuss Digest, Vol 116, Issue 14

2017-06-06 Thread Schwarze, Manuel
Hi Maxim,

The KNIME node "RDKit Molecule Fragmenter" is indeed quite useful. Since what 
it does for you in KNIME is based on RDKit functionality, you can do the same 
also outside of KNIME. Usually, there is not a single method to call, rather a 
code block that performs multiple operations, also partly based on certain 
parameters that one can configure in a KNIME node, but all is not that 
complicated when you are familiar with the RDKit and its classes, types and 
functionalities. 

Since the RDKit Nodes for KNIME are open source you can find the source code of 
the RDKit Molecule Fragmenter implementation in Github: 
https://github.com/rdkit/knime-rdkit/blob/master/org.rdkit.knime.nodes/src/org/rdkit/knime/nodes/molfragmenter/RDKitMolFragmenterNodeModel.java
 (lines 285-356 is the core algorithm)

This is Java code, but as the Java API of RDKit is a subset of what is 
available in C, you will be able to rewrite the same logic also in Python or C.

Best regards, Mit freundlichen Grüssen, Meilleures salutations, 
Manuel Schwarze
Senior Principal Software Engineer (KNIME, IJC, CSF, CIx)
 T: +41 61 3245330
M: +41 79 7470324
manuel.schwa...@novartis.com

Novartis Pharma AG
NIBR Informatics (NX) - IS SIGMA 
Novartis Campus, WSJ-310.5.18
CH-4002 Basel
Switzerland 


-Original Message-
From: rdkit-discuss-requ...@lists.sourceforge.net 
[mailto:rdkit-discuss-requ...@lists.sourceforge.net] 
Sent: Dienstag, 6. Juni 2017 14:15
To: rdkit-discuss@lists.sourceforge.net
Subject: Rdkit-discuss Digest, Vol 116, Issue 14

Send Rdkit-discuss mailing list submissions to
rdkit-discuss@lists.sourceforge.net

To subscribe or unsubscribe via the World Wide Web, visit
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
or, via email, send a message with subject or body 'help' to
rdkit-discuss-requ...@lists.sourceforge.net

You can reach the person managing the list at
rdkit-discuss-ow...@lists.sourceforge.net

When replying, please edit your Subject line so it is more specific than "Re: 
Contents of Rdkit-discuss digest..."


Today's Topics:

   1. RDkit Molecule Fragmenter (Popov, Maxim (Ext))


--

Message: 1
Date: Tue, 6 Jun 2017 11:43:08 +
From: "Popov, Maxim (Ext)" 
To: "rdkit-discuss@lists.sourceforge.net"

Subject: [Rdkit-discuss] RDkit Molecule Fragmenter
Message-ID:

Content-Type: text/plain; charset="us-ascii"

Dear All,

I have discovered a very usefule tool in Knime, Molecule Fragmenter by RDKit, 
but can't find a corresponding class or function outside of Knime. Can I use 
the Fragmenter without Knime?

Thanks!

Maxim
-- next part --
An HTML attachment was scrubbed...

--

--
Check out the vibrant tech community on one of the world's most engaging tech 
sites, Slashdot.org! http://sdm.link/slashdot

--

Subject: Digest Footer

___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


--

End of Rdkit-discuss Digest, Vol 116, Issue 14
**

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


[Rdkit-discuss] RDkit Molecule Fragmenter

2017-06-06 Thread Popov, Maxim (Ext)
Dear All,

I have discovered a very usefule tool in Knime, Molecule Fragmenter by RDKit, 
but can't find a corresponding class or function outside of Knime. Can I use 
the Fragmenter without Knime?

Thanks!

Maxim
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss


Re: [Rdkit-discuss] RDkit Molecule Fragmenter

2017-06-06 Thread Jan Halborg Jensen
I was also searching for this functionality earlier

For what it’s worth here’s some *very* simple code I hacked together to do 
fragmentation.  The focus is aromatic heterocycles, but it could be more 
general by, for example '[c,n]-[*]’ -> ‘[R]-[*]’  and ring = 
Chem.MolFromSmarts(‘[R]’) instead of '[n]'

Not pretty, but it worked for me

Bet regards, Jan


import sys, os, re
from rdkit import Chem
from rdkit.Chem import AllChem
from rdkit.Chem import Draw
from rdkit.Chem.Draw import IPythonConsole

rings_mol = []
rings_smiles = []

substituent_mol = []
substituent_smiles = []

smiles_file_name = "/Users/jan/Dropbox/Lundbeck/big.smiles"

smiles_file = open(smiles_file_name, "r")

for line in smiles_file:
words = line.split()
name = words[0]
smiles = words[1]

mol =  Chem.MolFromSmiles(smiles)

bis = mol.GetSubstructMatches(Chem.MolFromSmarts('[c,n]-[*]'))
bs = [mol.GetBondBetweenAtoms(x,y).GetIdx() for x,y in bis]

if len(bs) == 0:
if smiles not in rings_smiles:
rings_smiles.append(smiles)
rings_mol.append(Chem.MolFromSmiles(smiles))
continue

fragments_mol = Chem.FragmentOnBonds(mol,bs,addDummies=True)

big_fragment = Chem.MolToSmiles(fragments_mol,True)

big_fragment = re.sub(r'\[\d+\*\]',r'[*]',big_fragment)

fragments = big_fragment.split(".")

ring = Chem.MolFromSmarts('[n]')

for fragment in fragments:
if Chem.MolFromSmiles(fragment).HasSubstructMatch(ring):
if fragment not in rings_smiles:
rings_mol.append(Chem.MolFromSmiles(fragment))
rings_smiles.append(fragment)
else:
if fragment not in substituent_smiles:
substituent_mol.append(Chem.MolFromSmiles(fragment))
substituent_smiles.append(fragment)

img = 
Draw.MolsToGridImage(rings_mol,molsPerRow=4,subImgSize=(200,200),useSVG=True)

svg_file_name = "/Users/jan/Dropbox/Lundbeck/rings.svg"
svg_file = open(svg_file_name, 'w')
svg_file.write(img.data)
svg_file.close()
os.system('sed -i "" "s/xmlns:svg/xmlns/" '+svg_file_name)

img


On 06 Jun 2017, at 13:43, Popov, Maxim (Ext) 
> wrote:

Dear All,

I have discovered a very usefule tool in Knime, Molecule Fragmenter by RDKit, 
but can’t find a corresponding class or function outside of Knime. Can I use 
the Fragmenter without Knime?

Thanks!

Maxim
--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! 
http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

--
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot___
Rdkit-discuss mailing list
Rdkit-discuss@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss