searching/grepping for words near each other

2009-04-30 Thread VirginSnow
OK, I know we have a few grep gurus on this list...

I want to search a text file for a few (alphabetic) words which must
be near each other, but not necessarily on the same line.  Near
could be defined however you like... within a certain number of words
from each other, a certain number of charecters from each other, or
some similar constraint.

Is there any way to do this using grep?  If not, is there some other
tool (short of a desktop search engine) capable of doing this?

This seems like a rather elementary search task, so I figure someone
must have figured a convenient way to do it...

Any suggestions?
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: searching/grepping for words near each other

2009-04-30 Thread Kevin D. Clark

 I want to search a text file for a few (alphabetic) words which must
 be near each other, but not necessarily on the same line.  Near
 could be defined however you like... within a certain number of words
 from each other, a certain number of charecters from each other, or
 some similar constraint.

The following example looks for a certain famous phrase, but does so
in a loose manner, accepting anywhere from 1-200 lines of cruft
between the two parts of the phrase.  As far as the cruft goes, this
code doesn't care about linebreaks.

[pull-start the 500cc swiss-army chainsaw]

perl -0777 -lne 'print $ARGV:$ if (/weapons of.{1,200}mass destruction/s)' 
file1 file2 file3 ... fileN


(0777 causes Perl to undef $/ (go into slurp mode), and the /s
regexp modifier causes . to match newlines, which regexp engines
usually don't do)

--kevin
-- 
GnuPG ID: B280F24EGod, I loved that Pontiac.
alumni.unh.edu!kdc-- Tom Waits
http://kdc-blog.blogspot.com/ 
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: searching/grepping for words near each other

2009-04-30 Thread Michael ODonnell


 I want to search a text file for a few (alphabetic) words which
 must be near each other, but not necessarily on the same line.

grep is pretty much line oriented and although it's possible to script
elaborate workarounds involving transfers back and forth between the
pattern space and the hold space it's icky and slow to work against
the grain that way.  I predict that you'll end up using something like
Python or Perl.  I thought agrep ( the approximate grep that's part
of Glimpse) might do the trick as it's willing to let you specify very
sloppy search terms but, alas, it too is line oriented.
 
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Manning books, 50% until May 1st

2009-04-30 Thread Ted Roche
Around my office, we have radio or TV playing most of the day, and we 
often hear an exclamation from one or another of the inmates, Enough 
with the Twitter already!

Well, here's a good deal, and you don't even have to go on twitter to 
get it. The Manning publishers posted the offer,

Thanks all! 50% off any Manning book until Friday, May 1 with code 
twtr0501. Valid at Manning.com only.

There's a wxPython book that Ric Werme reviewed at PySIG, and The 
Well-Grounded Rubyist has gotten some good reviews.

And you twitterites (twitterheads?) might want to follow 
http://twitter.com/ManningBooks for future good deals.

-- 
Ted Roche
Ted Roche  Associates, LLC
http://www.tedroche.com

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: searching/grepping for words near each other

2009-04-30 Thread Ben Scott
On Thu, Apr 30, 2009 at 12:02 PM, Kevin D. Clark
kevin_d_cl...@comcast.net wrote:
 (0777 causes Perl to undef $/ (go into slurp mode),

  It took me a minute, and a RTFM moment, to figure that out.  For
those who, like me, didn't get it: That's a capital letter oh, not a
zero.  The -O switch to Perl specifies the record separator, which
is basically the line separator.  Normally it's a C newline.  You can
specify an octal or hex value for the character.  But there are some
magic values:

-O00(Two zeros.)  Paragraph mode, separating records by two 
or more
blank lines.
-O  (Nothing.)  ASCII NUL separator.  Useful with find 
-print0.
-O777   No record separator.

  With no record separator, the entire file gets sucked in as the
first and only record, newlines and all.  So it then becomes useful to
match newlines with the /s regexp modifier.  (Normally, the newline
will only be at the end of the record.  Matching that is rather
boring.  Especially if you use chomp.)

  I presume 777 was used because 777 was never a valid character for
either hex or octal.  But then Unicode happened, and characters could
be bigger than one byte.  So TFM says Unicode has to be specified in
hex, not octal.

  In code, you can set $/ to multi-character strings.

-- Ben
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: searching/grepping for words near each other

2009-04-30 Thread Andy Bair
One way to do what you want is to use hipdig.pl which is a utility in
the FTimes suite.  You can download FTimes and read more information
at the following URL.

  http://ftimes.sourceforge.net/FTimes/index.shtml

The hipdig utility is a Perl script that digs (searches) for hosts,
IPs, passwords, and custom regular expressions.  The online man page
for hipdig.pl is located at the URL below.

  http://ftimes.sourceforge.net/FTimes/Man+Pages/hipdig.shtml

You can use the hipdig custom type specify a regex that returns
characters around target strings as shown in the following example.
My test file is shown below.

  $ cat /tmp/test.1 

abc
foobar
def
uvw
barfoo
xyz

The command below specifies a custom type (-t) which is a regex that
searches for the string foobar and barfoo that are 0-20 characters
from each other.  Notice that special characters are URL-encoded in
the output so %0a is the newline character.

  $ hipdig.pl -h -t 'custom=(?i)foobar.{0,20}barfoo' /tmp/test.1 

name|type|tag|offset|string
/tmp/test.1|regexp||4|foobar%0adef%0auvw%0abarfoo

Hope that helps.

Andy

KoreLogic Security
603.465.3236 (Office)
603.340.2498 (Mobile)
http://www.korelogic.com
GnuPG Fingerprint: 688A 79EC B1E5 5748 CE87  1F20 2C45 60E7 0583 23B6

On Thu, Apr 30, 2009 at 03:35:55PM +, virgins...@vfemail.net wrote:
 OK, I know we have a few grep gurus on this list...
 
 I want to search a text file for a few (alphabetic) words which must
 be near each other, but not necessarily on the same line.  Near
 could be defined however you like... within a certain number of words
 from each other, a certain number of charecters from each other, or
 some similar constraint.
 
 Is there any way to do this using grep?  If not, is there some other
 tool (short of a desktop search engine) capable of doing this?
 
 This seems like a rather elementary search task, so I figure someone
 must have figured a convenient way to do it...
 
 Any suggestions?
 ___
 gnhlug-discuss mailing list
 gnhlug-discuss@mail.gnhlug.org
 http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/



pgpMH1AIQbuIT.pgp
Description: PGP signature
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: searching/grepping for words near each other

2009-04-30 Thread Kevin D. Clark

Ben Scott writes:

 On Thu, Apr 30, 2009 at 12:02 PM, Kevin D. Clark

  (0777 causes Perl to undef $/ (go into slurp mode),
 
   It took me a minute, and a RTFM moment, to figure that out.  For
 those who, like me, didn't get it: That's a capital letter oh, not a
 zero.  

Err, no, it's the other way around:  that most certainly is a zero and
not an oh.

One other small correction: when I wrote accepting anywhere from
1-200 *lines* of cruft between the two parts of the phrase I meant to
write accepting anywhere from 1-200 *characters* of cruft between the
two parts of the phrase.

Regards,

--kevin
-- 
And don't tell me there isn't one bit of difference between null and
 space, because that's exactly how much difference there is.  :-)
-- Larry Wall
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: searching/grepping for words near each other

2009-04-30 Thread Ben Scott
On Thu, Apr 30, 2009 at 4:45 PM, Kevin D. Clark
kevin_d_cl...@comcast.net wrote:
 Err, no, it's the other way around:  that most certainly is a zero and
 not an oh.

  Hrmm.  I couldn't find it at first, and then I switched to the other
way, and that found it.  But trying it with the actual perl
interpreter confirms Kevin is (of course) correct.  Phooey.

-- Ben

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: searching/grepping for words near each other

2009-04-30 Thread Kevin D. Clark

Ben Scott writes:

   Hrmm.  I couldn't find it at first, and then I switched to the other
 way, and that found it.  But trying it with the actual perl
 interpreter confirms Kevin is (of course) correct.  Phooey.

Ha ha ha.  Of course.

Part of being a competent software engineer involves learning that
you're probably wrong, oh, at least a dozen times a day.

Regards,

--kevin
-- 
GnuPG ID: B280F24EGod, I loved that Pontiac.
alumni.unh.edu!kdc-- Tom Waits
http://kdc-blog.blogspot.com/ 
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: searching/grepping for words near each other

2009-04-30 Thread VirginSnow
 From: kevin_d_cl...@comcast.net (Kevin D. Clark)
 Date: 30 Apr 2009 12:02:19 -0400

  I want to search a text file for a few (alphabetic) words which must
  be near each other, but not necessarily on the same line.  Near
  could be defined however you like... within a certain number of words
  from each other, a certain number of charecters from each other, or
  some similar constraint.

 perl -0777 -lne 'print $ARGV:$ if (/weapons of.{1,200}mass destruction/s)' 
 file1 file2 file3 ... fileN

That will work, but only if the search terms appear in document in the
same order as they appear in the query.  (This appears to be the case
with the hipdig solution as well.  Correct me if I'm wrong, of
course.)  The search terms I'm looking for could appear in the target
document in any order.  Perhaps I could have made that clearer.  Okay,
I *could* have made that clearer.

SEARCH:  weapons mass distraction

TEXT:UFOs are a distraction for people
 who enjoy buying energy for
 low-power laser weapons.  Taco
 shells are ETs' preferred food.
 Because of their low mass, they
 can be carried into orbit with
 with minimal distraction.

MATCHES: distraction (for...laser) weapons (Taco...low) mass
 weapons (Taco...low) mass (they...minimal) distration

You could do this kind of matching with Perl regexps, but they'd have
to be nested, with one level of nesting for each term... which would
quickly become both ugly and inefficient.

I was thinking along the lines of sorting the search terms, along with
blocks of the text, so that the terms would be rearranged into a
canonical order... but it's not clear how to choose blocks of text for
an efficient search.

I could use a Perl regexp like

  (.{0,200}(weapons|mass|distraction).{0,200}){3}

but that doesn't require that each alternative appear at least once.
Things like:

  distraction FOO mass BAR distraction

would match that regexp, but would return false positive results for
my search.
___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/


Re: searching/grepping for words near each other

2009-04-30 Thread Dan Jenkins
virgins...@vfemail.net wrote:
  From: kevin_d_cl...@comcast.net (Kevin D. Clark) Date: 30 Apr 2009
  12:02:19 -0400
  I want to search a text file for a few (alphabetic) words which
  must be near each other, but not necessarily on the same line.
  Near could be defined however you like... within a certain
  number of words from each other, a certain number of charecters
  from each other, or some similar constraint.

I believe this is often called a proximity search. I did write some 
code to do it three decades ago in Lisp. I don't recollect where I 
gleaned the algorithm though.

___
gnhlug-discuss mailing list
gnhlug-discuss@mail.gnhlug.org
http://mail.gnhlug.org/mailman/listinfo/gnhlug-discuss/