On Fri, 17 Dec 2004 14:22:34 +0000, rumours say that [EMAIL PROTECTED] might have written:
sf: >sf wrote: >> The point is that when you have 100,000s of records, this grep becomes >> really slow? > >There are performance bugs with current versions of grep >and multibyte characters that are only getting addressed now. >To work around these do `export LANG=C` first. You also should use the -F flag that Pádraig suggests, since you don't have regular expressions in the B file. >In my experience grep is not scalable since it's O(n^2). >See below (note A and B are randomized versions of >/usr/share/dict/words (and therefore worst case for the >sort method)). > >$ wc -l A B > 45427 A > 45427 B > >$ export LANG=C > >$ time grep -Fvf B A >real 0m0.437s > >$ time sort A B B | uniq -u >real 0m0.262s > >$ rpm -q grep coreutils >grep-2.5.1-16.1 >coreutils-4.5.3-19 sf, you better do your own benchmarks (there is quick, sample code in other posts of mine and Pádraig's) on your machine, since on my test machine the numbers are reversed re to these of Pádraig's (grep takes half the time). package versions (on SuSE 9.1 64-bit): $ rpm -q grep coreutils grep-2.5.1-427 coreutils-5.2.1-21 language: $ echo $LANG en_US.UTF-8 Caution: both solutions are interexchangeable as long as you don't have duplicate lines in the A file. If you do, use the grep version. -- TZOTZIOY, I speak England very best. "Be strict when sending and tolerant when receiving." (from RFC1958) I really should keep that in mind when talking with people, actually... -- http://mail.python.org/mailman/listinfo/python-list