Hi Cedric,
In the example that you include below, I don't see any use of the
RDKit at all, so I'm confused about what you're trying to do.
If you'd like to identify duplicates in a file without using a
SmilesMolSupplier to read the file, you could do something like the
following (untested):
# ------------------
from rdkit import Chem
inF=file('in.smi','r')
uniqF=file('uniq.smi','w+')
dupeF=file('dupe.smi','w+')
uniq={}
nStructs=0
for line in inF:
m = Chem.MolFromSmiles(line.strip())
if not m: continue
smi = Chem.MolToSmiles(m,True)
if uniq.has_key(smi):
print >>dupeF,uniq[smi],smi,line.strip()
else:
nStructs+=1
uniq[smi]=nStructs
print >>uniqF,nStructs,smi
# ------------------
Best regards,
-greg
On Mon, May 31, 2010 at 5:57 PM, Cedric MORETTI
<[email protected]> wrote:
> Hello all,
>
> I am writing because I have a concern with RDKIT ...
>
> I am trying to remove duplicates from a file "SMI" and put a numerical code
> to code before the doubloon.
> In a first file, you will find files without duplicating code before (1,
> CCBC; 2 CCCCCC; 3CCCCCCCCC, etc ...)
> In a second file, the code duplication with the other file. (1, CCBC, 1,
> CCBC, 1, CCBC, 2, CCCCCC; 2 CCCCCC; etc ....)
> My problem is that RDKIT will not make the difference between:
> C1CCCCC1
> And
> c1cccccc1
> So having an output file:
> (1 C1CCCCC1; 2 c1cccccc1, etc ....)
> Thank you
>
>
>
> My code
>
>
>
>
>
> # script qui a partir d'un fichier smi va retirer les doublons et
> enregistrer dans un autre fichier smi
>
>
>
>
>
> print "hello from RD_remove_duplicate"
>
> from sys import *
>
>
>
>
>
>
>
> from cinfony import rdk
>
> from rdkit import Chem
>
>
>
> # Dictionary storing the canonical codes seen so far
>
> codes = {}
>
>
>
> # Count of total number of structures found
>
> numStructures = 0
>
> # Count of duplicate structures found
>
> numDuplicates = 0
>
>
>
> suppl = open("C:\Data\etudecycle/etudecyclebdzei.smi","r")
>
> output_file = "C:\Data\etudecycle/etudecyclebdzeiv2.smi"
>
> writer = open(output_file,'w')
>
> output_filev2 = "C:\Data\etudecycle/etudecyclebdzeiv3.smi"
>
> wd = open(output_filev2,'w')
>
>
>
> # Read the first SMI file
>
>
>
> i = 0
>
> a = 0
>
> while 1:
>
> bdsmi = suppl.readline()
>
> if not bdsmi:
>
> break
>
> pass
>
>
>
>
>
> # Check for a duplicate
>
>
>
> if codes.has_key(bdsmi):
>
> numDuplicates += 1
>
>
>
> wd.write(str(a))
>
> wd.write(str(","))
>
> wd.write(bdsmi)
>
>
>
>
>
>
>
>
>
> else:
>
> # Store it in the dictionary so that we can check for duplicates
>
> codes[bdsmi] = True
>
> # Write the structure
>
> a +=1
>
> writer.write(str(a))
>
> writer.write(str(","))
>
> writer.write(bdsmi)
>
> numStructures += 1
>
> i +=1
>
> #count the compounds
>
> if i == int((i/1000)*1000):
>
> print i
>
>
>
> print " initials numbers= " + str(numDuplicates+ numStructures)
>
> print " duplicates numbers = " + str(numDuplicates)
>
> print " final numbers = " + str(numStructures)
>
>
>
> **********************************************************************
> DISCLAIMER
> This email and any files transmitted with it, including replies and
> forwarded copies (which may contain alterations) subsequently transmitted
> from Firmenich, are confidential and solely for the use of the intended
> recipient. The contents do not represent the opinion of Firmenich except to
> the extent that it relates to their official business.
> **********************************************************************
>
>
> ------------------------------------------------------------------------------
>
>
> _______________________________________________
> Rdkit-discuss mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/rdkit-discuss
>
>
------------------------------------------------------------------------------
_______________________________________________
Rdkit-discuss mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/rdkit-discuss