On Fri, 17 Dec 2004 12:21:08 +0000, rumours say that [EMAIL PROTECTED] might have written:
[snip some damn lie aka "benchmark"] [me] >> (Yes, I cheated by adding the F (for no regular expressions) flag :) > >Also you only have 1000 entries in B! >Try it again with all entries in B also ;-) >Remember the original poster had 100K entries! Well, that's the closest I can do: $ py Python 2.4c1 (#3, Nov 26 2004, 23:39:44) [GCC 3.3.3 (SuSE Linux)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>> import sys; sys.ps1='.>>' .>> alist=[line.strip() for line in open('/usr/share/dict/words')] .>> words=set() .>> for word in alist: ... words.add(word + '\n') ... words.add(word[::-1] + '\n') ... .>> len(words) 90525 .>> words=list(words) .>> open('/tmp/A', 'w').writelines(words) .>> import random; random.shuffle(words) .>> open('/tmp/B', 'w').writelines(words[:90000]) .>> $ time sort A B B | uniq -u >/dev/null real 0m2.408s user 0m2.437s sys 0m0.037s $ time grep -Fvf B A >/dev/null real 0m1.208s user 0m1.161s sys 0m0.035s What now?-) Mind you, I only replied in the first place because you wrote (my emphasis) "...here is *the* unix way..." and it's the bad days of the month (not mine, actually, but I suffer along...) >>>>and finally destroys original line >>>>order (should it be important). >>> >>>true >> >> That's our final agreement :) > >Note the order is trivial to restore with a >"decorate-sort-undecorate" idiom. Using python or unix tools (eg 'paste -d', 'sort -k', 'cut -d')? Because the python way has been already discussed by Friedrik, John and Tim, and the unix way gets overly complicated (aka non-trivial) if DSU is involved. BTW, the following occurred to me: [EMAIL PROTECTED]/tmp $ cat >A aa ss dd ff gg hh jj kk ll aa [EMAIL PROTECTED]/tmp $ cat >B ss ff hh kk [EMAIL PROTECTED]/tmp $ sort A B B | uniq -u dd gg jj ll [EMAIL PROTECTED]/tmp $ grep -Fvf B A aa dd gg jj ll aa Note that 'aa' is contained twice in the A file (to be filtered by B). So our methods do not produce the same output. As far as the OP wrote: >Essentially, want to do efficient grep, i..e from A remove those lines which >are also present in file B. grep is the unix way to go for both speed and correctness. I would call this issue a dead horse. -- TZOTZIOY, I speak England very best. "Be strict when sending and tolerant when receiving." (from RFC1958) I really should keep that in mind when talking with people, actually... -- http://mail.python.org/mailman/listinfo/python-list