On Sunday 02 February 2003 12:46 am, magnet wrote:
> I have a large text file containing thousands of url's, one per line, and
> am trying to find a suitable utility that will strip out identical lines
> and leave a condensed file. Can anyone suggest a good solution?
> Thanks :)
---------------------------------------------------------------------------
#!/usr/bin/env python
import sys, os
if len(sys.argv) <= 2:
print "Usage is './duprem infile outfile"
sys.exit(1)
HOME=os.expanduser("~")
infile=sys.argv[1]
outfile=sys.argv[2]
def userhome(filename):
if string.find(HOME,filename)==0:
return filename
else:
return HOME+filename
infile=userhome(infile)
outfile=userhome(outfile)
Goodinput=os.system('[ -e infile ]')
if Goodinput != 0:
print "input file "+infile+" does not exist"
sys.exit(2)
input=open(infile,"r")
output=open(outfile,"w")
G=[]
g=input.readline()
while len(g) > 0:
i=0
for x in G:
if x == g:
i=1
print "duplicate "+g+" removed"
break
if i == 0:
G.append(g)
g=input.readline()
for x in G:
output.write(x)
output.close
print "complete"
-----------------------------------------------------------------
Well put everything between the dashed lines into a text file called duprem in
your user space, then chmod a+x duprem then call it by
./duprem (fileofurlswithduplicates) (outputfilecleanedofdups)
Civileme
Want to buy your Pack or Services from MandrakeSoft?
Go to http://www.mandrakestore.com