Re: searching/grepping for words near each other
That will work, but only if the search terms appear in document in the same order as they appear in the query. (This appears to be the case with the hipdig solution as well. Correct me if I'm wrong, of course.) The search terms I'm looking for could appear in the target document in any order. Perhaps I could have made that clearer. Okay, I *could* have made that clearer. Tricky but doable. #!/usr/bin/perl # author: kevin d. clark use warnings; use strict; my %seen; my $r; undef $/; while (defined($_=)) { undef $^N; print Filename: $ARGV\n$\n if (/ ((??{ $r = ; $seen{$^N}++ if (defined($^N)); $r .= weapons| if (!defined($seen{weapons})); $r .= mass| if (!defined($seen{mass})); $r .= distraction if (!defined($seen{distraction})); $r; })) .{0,100}? # 0-100 characters of any random cruft ((??{ $r = ; $seen{$^N}++ if (defined($^N)); $r .= weapons| if (!defined($seen{weapons})); $r .= mass| if (!defined($seen{mass})); $r .= distraction if (!defined($seen{distraction})); $r; })) .{0,100}? # 0-100 characters of any random cruft ((??{ $r = ; $seen{$^N}++ if (defined($^N)); $r .= weapons| if (!defined($seen{weapons})); $r .= mass| if (!defined($seen{mass})); $r .= distraction if (!defined($seen{distraction})); $r; })) /xs) } __END__ I could generalize this and all, but I am busy. Basically, the gist of this code is that it generates the regexp to match while the regexp engine is doing the matching. Just another Perl hacker, --kevin -- GnuPG ID: B280F24EGod, I loved that Pontiac. alumni.unh.edu!kdc-- Tom Waits http://kdc-blog.blogspot.com/ ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: searching/grepping for words near each other
On Fri, May 1, 2009 at 12:20 AM, Dan Jenkins d...@rastech.com wrote: I believe this is often called a proximity search. these days, this would be a job for a search engine. eg, for perl http://search.cpan.org/~tmtm/Plucene-1.25/lib/Plucene.pm -- Bill n1...@arrl.net bill.n1...@gmail.com ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
searching/grepping for words near each other
OK, I know we have a few grep gurus on this list... I want to search a text file for a few (alphabetic) words which must be near each other, but not necessarily on the same line. Near could be defined however you like... within a certain number of words from each other, a certain number of charecters from each other, or some similar constraint. Is there any way to do this using grep? If not, is there some other tool (short of a desktop search engine) capable of doing this? This seems like a rather elementary search task, so I figure someone must have figured a convenient way to do it... Any suggestions? ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: searching/grepping for words near each other
I want to search a text file for a few (alphabetic) words which must be near each other, but not necessarily on the same line. Near could be defined however you like... within a certain number of words from each other, a certain number of charecters from each other, or some similar constraint. The following example looks for a certain famous phrase, but does so in a loose manner, accepting anywhere from 1-200 lines of cruft between the two parts of the phrase. As far as the cruft goes, this code doesn't care about linebreaks. [pull-start the 500cc swiss-army chainsaw] perl -0777 -lne 'print $ARGV:$ if (/weapons of.{1,200}mass destruction/s)' file1 file2 file3 ... fileN (0777 causes Perl to undef $/ (go into slurp mode), and the /s regexp modifier causes . to match newlines, which regexp engines usually don't do) --kevin -- GnuPG ID: B280F24EGod, I loved that Pontiac. alumni.unh.edu!kdc-- Tom Waits http://kdc-blog.blogspot.com/ ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: searching/grepping for words near each other
I want to search a text file for a few (alphabetic) words which must be near each other, but not necessarily on the same line. grep is pretty much line oriented and although it's possible to script elaborate workarounds involving transfers back and forth between the pattern space and the hold space it's icky and slow to work against the grain that way. I predict that you'll end up using something like Python or Perl. I thought agrep ( the approximate grep that's part of Glimpse) might do the trick as it's willing to let you specify very sloppy search terms but, alas, it too is line oriented. ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: searching/grepping for words near each other
On Thu, Apr 30, 2009 at 12:02 PM, Kevin D. Clark kevin_d_cl...@comcast.net wrote: (0777 causes Perl to undef $/ (go into slurp mode), It took me a minute, and a RTFM moment, to figure that out. For those who, like me, didn't get it: That's a capital letter oh, not a zero. The -O switch to Perl specifies the record separator, which is basically the line separator. Normally it's a C newline. You can specify an octal or hex value for the character. But there are some magic values: -O00(Two zeros.) Paragraph mode, separating records by two or more blank lines. -O (Nothing.) ASCII NUL separator. Useful with find -print0. -O777 No record separator. With no record separator, the entire file gets sucked in as the first and only record, newlines and all. So it then becomes useful to match newlines with the /s regexp modifier. (Normally, the newline will only be at the end of the record. Matching that is rather boring. Especially if you use chomp.) I presume 777 was used because 777 was never a valid character for either hex or octal. But then Unicode happened, and characters could be bigger than one byte. So TFM says Unicode has to be specified in hex, not octal. In code, you can set $/ to multi-character strings. -- Ben ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: searching/grepping for words near each other
One way to do what you want is to use hipdig.pl which is a utility in the FTimes suite. You can download FTimes and read more information at the following URL. http://ftimes.sourceforge.net/FTimes/index.shtml The hipdig utility is a Perl script that digs (searches) for hosts, IPs, passwords, and custom regular expressions. The online man page for hipdig.pl is located at the URL below. http://ftimes.sourceforge.net/FTimes/Man+Pages/hipdig.shtml You can use the hipdig custom type specify a regex that returns characters around target strings as shown in the following example. My test file is shown below. $ cat /tmp/test.1 abc foobar def uvw barfoo xyz The command below specifies a custom type (-t) which is a regex that searches for the string foobar and barfoo that are 0-20 characters from each other. Notice that special characters are URL-encoded in the output so %0a is the newline character. $ hipdig.pl -h -t 'custom=(?i)foobar.{0,20}barfoo' /tmp/test.1 name|type|tag|offset|string /tmp/test.1|regexp||4|foobar%0adef%0auvw%0abarfoo Hope that helps. Andy KoreLogic Security 603.465.3236 (Office) 603.340.2498 (Mobile) http://www.korelogic.com GnuPG Fingerprint: 688A 79EC B1E5 5748 CE87 1F20 2C45 60E7 0583 23B6 On Thu, Apr 30, 2009 at 03:35:55PM +, virgins...@vfemail.net wrote: OK, I know we have a few grep gurus on this list... I want to search a text file for a few (alphabetic) words which must be near each other, but not necessarily on the same line. Near could be defined however you like... within a certain number of words from each other, a certain number of charecters from each other, or some similar constraint. Is there any way to do this using grep? If not, is there some other tool (short of a desktop search engine) capable of doing this? This seems like a rather elementary search task, so I figure someone must have figured a convenient way to do it... Any suggestions? ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/ pgpMH1AIQbuIT.pgp Description: PGP signature ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: searching/grepping for words near each other
Ben Scott writes: On Thu, Apr 30, 2009 at 12:02 PM, Kevin D. Clark (0777 causes Perl to undef $/ (go into slurp mode), It took me a minute, and a RTFM moment, to figure that out. For those who, like me, didn't get it: That's a capital letter oh, not a zero. Err, no, it's the other way around: that most certainly is a zero and not an oh. One other small correction: when I wrote accepting anywhere from 1-200 *lines* of cruft between the two parts of the phrase I meant to write accepting anywhere from 1-200 *characters* of cruft between the two parts of the phrase. Regards, --kevin -- And don't tell me there isn't one bit of difference between null and space, because that's exactly how much difference there is. :-) -- Larry Wall ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: searching/grepping for words near each other
On Thu, Apr 30, 2009 at 4:45 PM, Kevin D. Clark kevin_d_cl...@comcast.net wrote: Err, no, it's the other way around: that most certainly is a zero and not an oh. Hrmm. I couldn't find it at first, and then I switched to the other way, and that found it. But trying it with the actual perl interpreter confirms Kevin is (of course) correct. Phooey. -- Ben ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: searching/grepping for words near each other
Ben Scott writes: Hrmm. I couldn't find it at first, and then I switched to the other way, and that found it. But trying it with the actual perl interpreter confirms Kevin is (of course) correct. Phooey. Ha ha ha. Of course. Part of being a competent software engineer involves learning that you're probably wrong, oh, at least a dozen times a day. Regards, --kevin -- GnuPG ID: B280F24EGod, I loved that Pontiac. alumni.unh.edu!kdc-- Tom Waits http://kdc-blog.blogspot.com/ ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: searching/grepping for words near each other
From: kevin_d_cl...@comcast.net (Kevin D. Clark) Date: 30 Apr 2009 12:02:19 -0400 I want to search a text file for a few (alphabetic) words which must be near each other, but not necessarily on the same line. Near could be defined however you like... within a certain number of words from each other, a certain number of charecters from each other, or some similar constraint. perl -0777 -lne 'print $ARGV:$ if (/weapons of.{1,200}mass destruction/s)' file1 file2 file3 ... fileN That will work, but only if the search terms appear in document in the same order as they appear in the query. (This appears to be the case with the hipdig solution as well. Correct me if I'm wrong, of course.) The search terms I'm looking for could appear in the target document in any order. Perhaps I could have made that clearer. Okay, I *could* have made that clearer. SEARCH: weapons mass distraction TEXT:UFOs are a distraction for people who enjoy buying energy for low-power laser weapons. Taco shells are ETs' preferred food. Because of their low mass, they can be carried into orbit with with minimal distraction. MATCHES: distraction (for...laser) weapons (Taco...low) mass weapons (Taco...low) mass (they...minimal) distration You could do this kind of matching with Perl regexps, but they'd have to be nested, with one level of nesting for each term... which would quickly become both ugly and inefficient. I was thinking along the lines of sorting the search terms, along with blocks of the text, so that the terms would be rearranged into a canonical order... but it's not clear how to choose blocks of text for an efficient search. I could use a Perl regexp like (.{0,200}(weapons|mass|distraction).{0,200}){3} but that doesn't require that each alternative appear at least once. Things like: distraction FOO mass BAR distraction would match that regexp, but would return false positive results for my search. ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/
Re: searching/grepping for words near each other
virgins...@vfemail.net wrote: From: kevin_d_cl...@comcast.net (Kevin D. Clark) Date: 30 Apr 2009 12:02:19 -0400 I want to search a text file for a few (alphabetic) words which must be near each other, but not necessarily on the same line. Near could be defined however you like... within a certain number of words from each other, a certain number of charecters from each other, or some similar constraint. I believe this is often called a proximity search. I did write some code to do it three decades ago in Lisp. I don't recollect where I gleaned the algorithm though. ___ gnhlug-discuss mailing list gnhlug-discuss@mail.gnhlug.org http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/