Re: [htdig] using perl/cron to find badwords on site

2001-01-11 Thread Gilles Detillieux

According to Jerry Preeper:
> I don't know if anyone else has run across this yet, but I have a number of
> guestbooks and things like that where people can post and I would love to
> be able to find a way to set up a daily cron job with perl script that
> basically runs a set of badwords through htsearch and then emails me a list
> of just the urls it finds with those words in it... I don't really need
> things like the page title or description or stuff like that..  I'm
> assuming I'll need to use a system call in the script to some sort of
> command line option and loop it for each word...  Any input would be
> greatly appreciated.

I assume that you want your htdig database updated through this same
cron job, before running htsearch, so that the database you search will
contain any new postings to the guestbooks.  The simplest way I can
think of, assuming the correct settings are already made in htdig.conf,
would be a shell script with these commands...

  htdig
  htmerge
  /path/to/cgi-bin/htsearch "words=badword1+badword2+badword3+badword4"

Of course, if you want to write it in Perl, especially if you need more
processing than simply running these programs, you can call the above
commands in one or more calls to the system("...") function in Perl.

You may want to customise the htsearch templates to get just the URL,
if that's all you want (see template_map, search_results_header and
search_results_footer in http://www.htdig.org/attrs.html).  If you want
to search for each word separately, rather than one query for all words,
then you'd need to call htsearch once for each individual word.  E.g. in
a shell script, you could do:

  htdig; htmerge
  for word in badword1 badword2 badword3 badword4
  do
echo "${word}:"
/path/to/cgi-bin/htsearch "words=${word}"
  done

or:

  htdig; htmerge
  while read word
  do
echo "${word}:"
/path/to/cgi-bin/htsearch "words=${word}"
  done < /path/to/bad-word-file

However, it seems to me it would be better to search for all at once,
unless you need a word by word summary of URLs.

-- 
Gilles R. Detillieux  E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ:




[htdig] using perl/cron to find badwords on site

2001-01-11 Thread Jerry Preeper

Hi all, 
I don't know if anyone else has run across this yet, but I have a number of
guestbooks and things like that where people can post and I would love to
be able to find a way to set up a daily cron job with perl script that
basically runs a set of badwords through htsearch and then emails me a list
of just the urls it finds with those words in it... I don't really need
things like the page title or description or stuff like that..  I'm
assuming I'll need to use a system call in the script to some sort of
command line option and loop it for each word...  Any input would be
greatly appreciated.
Jerry



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  
FAQ: