Ju Hui wrote: > I want to remove about 50000 elements from a list,which has 100000 > elements. > sample code like below: > >>>> a=range(10) >>>> b=range(4) >>>> for x in b: > ... a.remove(x) > ... >>>> a > [4, 5, 6, 7, 8, 9] > > when a and b is small size, it will finished quickly, but when a and b > have many elements. > such as: >>>> a=range(100000) >>>> b=range(50000) >>>> for x in b: > ... a.remove(x) > ... > it will very slowly. Shall I change to another data structure and choos > a better arithmetic? > any suggestion is welcome. > thanks a lot! > A list isn't going to respond very well to random removals of large numbers of elements like this. Remember that it must do linear search to find the value to be removed and then basically build a completely new list with the remaining values. Depending on the data in question, you might be able to leverage things. Are the two lists sorted? If so you can iterate over both of them and build a third list with the results.
This is not particularly 'elegant' but it is fast for sorted lists: import time a=range(100000) b=range(50000) start_time=time.time() for x in b: a.remove(x) stop_time=time.time() print "brute force elapsed time=%.2f seconds" % (stop_time-start_time) start_time=time.time() n=[] i1=0 i2=0 while 1: try: v1=a[i1] except IndexError: break try: v2=b[i2] except IndexError: n.extend(a[i1:]) break if v1 < v2: n.append(v1) i1+=1 continue elif v1 > v2: i2+=1 continue else: i1+=1 i2+=1 continue stop_time=time.time() print "new elapsed time=%.2f seconds" % (stop_time-start_time) start_time=time.time() a=set(a) b=set(b) a.symmetric_difference(b) stop_time=time.time() print "sets elapsed time=%.2f seconds" % (stop_time-start_time) brute force elapsed time=4.13 seconds new elapsed time=0.05 seconds sets elapsed time=0.03 seconds There may be other ways using bisect or some other module. If data is random, unsorted or has duplicates, I would look at in-memory database like SQLlite instead. If the values are unique you can do something like: a=set(range(100000)) b=set(range(50000)) a.symmetric_difference(b) Which is the fastest way. It all depends on the "real" data which we can't tell from your test using range(). -Larry Bates -- http://mail.python.org/mailman/listinfo/python-list