Re: [Rdkit-discuss] uppercase/lowercase with RDKIT

Greg Landrum Mon, 31 May 2010 21:05:13 -0700

Hi Cedric,

In the example that you include below, I don't see any use of the
RDKit at all, so I'm confused about what you're trying to do.


If you'd like to identify duplicates in a file without using a
SmilesMolSupplier to read the file, you could do something like the
following (untested):

# ------------------
from rdkit import Chem
inF=file('in.smi','r')
uniqF=file('uniq.smi','w+')
dupeF=file('dupe.smi','w+')

uniq={}
nStructs=0
for line in inF:
  m = Chem.MolFromSmiles(line.strip())
  if not m: continue
  smi = Chem.MolToSmiles(m,True)
  if uniq.has_key(smi):
    print >>dupeF,uniq[smi],smi,line.strip()
  else:
    nStructs+=1
    uniq[smi]=nStructs
    print >>uniqF,nStructs,smi
# ------------------

Best regards,
-greg

On Mon, May 31, 2010 at 5:57 PM, Cedric MORETTI
<[email protected]> wrote:
> Hello all,
>
> I am writing because I have a concern with RDKIT ...
>
> I am trying to remove duplicates from a file "SMI" and put a numerical code
> to code before the doubloon.
> In a first file, you will find files without duplicating code before (1,
> CCBC; 2 CCCCCC; 3CCCCCCCCC, etc ...)
> In a second file, the code duplication with the other file. (1, CCBC, 1,
> CCBC, 1, CCBC, 2, CCCCCC; 2 CCCCCC; etc ....)
> My problem is that RDKIT will not make the difference between:
> C1CCCCC1
> And
> c1cccccc1
> So having an output file:
>  (1 C1CCCCC1; 2 c1cccccc1, etc ....)
> Thank you
>
>
>
> My code
>
>
>
>
>
> # script qui a partir d'un fichier smi va retirer les doublons et
> enregistrer dans un autre fichier smi
>
>
>
>
>
> print "hello from RD_remove_duplicate"
>
> from sys import *
>
>
>
>
>
>
>
> from cinfony import rdk
>
> from rdkit import Chem
>
>
>
> # Dictionary storing the canonical codes seen so far
>
> codes = {}
>
>
>
> # Count of total number of structures found
>
> numStructures = 0
>
> # Count of duplicate structures found
>
> numDuplicates = 0
>
>
>
> suppl = open("C:\Data\etudecycle/etudecyclebdzei.smi","r")
>
> output_file = "C:\Data\etudecycle/etudecyclebdzeiv2.smi"
>
> writer = open(output_file,'w')
>
> output_filev2 = "C:\Data\etudecycle/etudecyclebdzeiv3.smi"
>
> wd = open(output_filev2,'w')
>
>
>
> # Read the first SMI file
>
>
>
> i = 0
>
> a = 0
>
> while 1:
>
>     bdsmi = suppl.readline()
>
>     if not bdsmi:
>
>         break
>
>     pass
>
>
>
>
>
>       # Check for a duplicate
>
>
>
>     if codes.has_key(bdsmi):
>
>        numDuplicates += 1
>
>
>
>        wd.write(str(a))
>
>        wd.write(str(","))
>
>        wd.write(bdsmi)
>
>
>
>
>
>
>
>
>
>     else:
>
>     # Store it in the dictionary so that we can check for duplicates
>
>        codes[bdsmi] = True
>
>        # Write the structure
>
>        a +=1
>
>        writer.write(str(a))
>
>        writer.write(str(","))
>
>        writer.write(bdsmi)
>
>        numStructures += 1
>
>        i +=1
>
>     #count the compounds
>
>        if i == int((i/1000)*1000):
>
>           print i
>
>
>
> print " initials numbers= " + str(numDuplicates+ numStructures)
>
> print " duplicates  numbers = " + str(numDuplicates)
>
> print " final  numbers = " + str(numStructures)
>
>
>
> **********************************************************************
> DISCLAIMER
> This email and any files transmitted with it, including replies and
> forwarded copies (which may contain alterations) subsequently transmitted
> from Firmenich, are confidential and solely for the use of the intended
> recipient. The contents do not represent the opinion of Firmenich except to
> the extent that it relates to their official business.
> **********************************************************************
>
>
> ------------------------------------------------------------------------------
>
>
> _______________________________________________
> Rdkit-discuss mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>

------------------------------------------------------------------------------

_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss

Re: [Rdkit-discuss] uppercase/lowercase with RDKIT

Reply via email to