[htdig] using perl/cron to find badwords on site

2001-01-11 Thread Jerry Preeper

Hi all, 
I don't know if anyone else has run across this yet, but I have a number of
guestbooks and things like that where people can post and I would love to
be able to find a way to set up a daily cron job with perl script that
basically runs a set of badwords through htsearch and then emails me a list
of just the urls it finds with those words in it... I don't really need
things like the page title or description or stuff like that..  I'm
assuming I'll need to use a system call in the script to some sort of
command line option and loop it for each word...  Any input would be
greatly appreciated.
Jerry



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




[htdig] External Converter Prob

2001-01-11 Thread Reich, Stefan

Hi all,

all my descriptions are starting with "content-type: text/html".

Is this normal behavior or is it, because I'm using an external converter to
do some modifications on the spidered html files. I registered my converter
for text/html - text/myhtml conversion. I've patched the html parser to
recognize this in addition to text/html. 

I'm sure my external converter doesn't write text/html to the output stream.

Any ideas?

Tnx

  Stefan


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] keep temp files while running indexer? How to...

2001-01-11 Thread Gilles Detillieux

According to Stephen Murray:
 Hi Gilles,
 
 When you wrote:
 
 "Only
 
  the one on the contrib section of the FTP site and web site is
 
  current."
 
 You were referring to rundig.sh at http://www.htdig.org/contrib/ -
 
 -- right? That's the one I should use? (As Geoff suggested?)

I answered this yesterday evening, after the first time you asked, but
here goes again...

  Yes, it's the Scripts sub-section of that part of the web site, which
  actually takes you to the http://www.htdig.org/files/contrib/scripts/
  directory.

To clarify further, the one you should NOT use is the one in the contrib
directory of the htdig-3.1.5.tar.gz source distribution, or any other
source distribution, as these are the ones that are outdated.

Using the URL you mentioned in your e-mail above will get you to the
correct script in two clicks.  First, the link "Scripts" in the left
frame, then the link "rundig.sh" in the right frame.  Using the URL
I mentioned in my reply above will get you there in one click.  Either
way, it's the same directory and the same file, presented either with
or without the frame structure on the web page.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




RE: [htdig] htdig

2001-01-11 Thread Geoff Hutchison

At 9:00 AM -0500 1/11/01, Chuck Umeh wrote:
One OF the db is about 1.5GB

OK, so does this seem reasonable? At the least, you're not likely 
running up against any OS restrictions on file size (normally the 
first one is 2GB for some OSes).

Have you tried running htdig -i -v like I suggested? I suspect you 
have an infinite loop or close to that.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] using perl/cron to find badwords on site

2001-01-11 Thread Gilles Detillieux

According to Jerry Preeper:
 I don't know if anyone else has run across this yet, but I have a number of
 guestbooks and things like that where people can post and I would love to
 be able to find a way to set up a daily cron job with perl script that
 basically runs a set of badwords through htsearch and then emails me a list
 of just the urls it finds with those words in it... I don't really need
 things like the page title or description or stuff like that..  I'm
 assuming I'll need to use a system call in the script to some sort of
 command line option and loop it for each word...  Any input would be
 greatly appreciated.

I assume that you want your htdig database updated through this same
cron job, before running htsearch, so that the database you search will
contain any new postings to the guestbooks.  The simplest way I can
think of, assuming the correct settings are already made in htdig.conf,
would be a shell script with these commands...

  htdig
  htmerge
  /path/to/cgi-bin/htsearch "words=badword1+badword2+badword3+badword4"

Of course, if you want to write it in Perl, especially if you need more
processing than simply running these programs, you can call the above
commands in one or more calls to the system("...") function in Perl.

You may want to customise the htsearch templates to get just the URL,
if that's all you want (see template_map, search_results_header and
search_results_footer in http://www.htdig.org/attrs.html).  If you want
to search for each word separately, rather than one query for all words,
then you'd need to call htsearch once for each individual word.  E.g. in
a shell script, you could do:

  htdig; htmerge
  for word in badword1 badword2 badword3 badword4
  do
echo "${word}:"
/path/to/cgi-bin/htsearch "words=${word}"
  done

or:

  htdig; htmerge
  while read word
  do
echo "${word}:"
/path/to/cgi-bin/htsearch "words=${word}"
  done  /path/to/bad-word-file

However, it seems to me it would be better to search for all at once,
unless you need a word by word summary of URLs.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] External converters - two questions

2001-01-11 Thread David Adams

Thanks Giles, that is usefull information.  I had thought that perhaps
pdtotext actually *added* hyphenation to a document.  If the problem is
removing the hyphenation that is actually written into the document then I
can see that not everbody will wish to do this.  It is easily switched off
in doc2html.pl, but only if you know where to look.  The next version will
definitely be better in this respect.

As for magic numbers, I'll wait and see if anybody else can offer some
additional observations.

--
David Adams
Computing Services
Southampton University


- Original Message -
From: "Gilles Detillieux" [EMAIL PROTECTED]
To: "David Adams" [EMAIL PROTECTED]
Cc: [EMAIL PROTECTED]
Sent: Thursday, January 11, 2001 1:12 AM
Subject: Re: [htdig] External converters - two questions


 According to David Adams:
  I hope to find time for a further revision of the external converter
script
  doc2html.pl and possibly simplify it a little.
 
  The existing code includes de-hyphenation (which is buggy) taken
originally
  from parsedoc.pl.  The question is:
  is this necessary, does pdftotext (or any other utility) actually break
up
  words across lines with the addition of hyphens?  Is the hyphenation
code of
  any use?  Information and opinions are requested.

 I added this code for dealing with a lot of the PDFs I needed to index
 on my site, and for the Manitoba Unix User Group web site as well (for
their
 newsletters).  Unlike HTML documents, I've found a lot of PDF files make
 pretty heavy use of hyphenation.  Without the dehyphenation code,
hyphenated
 words appeared as two separate words in the resulting text.  E.g. "conv-
 erter" was taken as "conv" and "erter", so a search for "converter" may
 not turn up this document if the word didn't appear unbroken elsewhere
 in the document.

 Sorry about the EOF bug in this code.  It was a quick hack, and I don't
 know Perl all that well.  There was a patch to fix this, though.  Are
there
 any other bugs?

 In any case, in parse_doc.pl and conv_doc.pl, I wrote it to be optional,
 enabled by this line:

 $dehyphenate = 1;   # PDFs often have hyphenated lines

 which only applied to PDFs.  The ps2ascii utility already does its own
 dehyphenation, but pdftotext doesn't.  Other document types are less
 likely to need this.  If dehyphenation of PDFs is not desired, it's easy
 enough to change the 1 to a 0 above when configuring the script.  I don't
 recall if your doc2html.pl has the same sort of option.

  Also inherited from parsedoc.pl is extra code to cope with files which
may
  be an "HP Print job" or contain a "MacBinary header".  Are such files
really
  encountered?  If so what type of files are they, Word, PDF or what?
  Does the magic number code need to take account of them?

 Another hack of mine.  The MUUG web site had some pretty odd-ball
 PostScript files on it that were causing error messages while indexing
 their site.  Instead of simple and pure PS in these files, some had a
 MacBinary wrapper or HP PJL codes in them, which ps2ascii happily would
 skip over, but the Perl code wasn't accepting these files.  These hacks
 were to allow these files through.  Dunno if anyone else has found they
 help or hurt them, but I'm keeping them in my own copies of the scripts.
 I know they're kind of ugly, so if you want to get rid of them in your
 code for the sake of simplicity, I'd certainly understand.

 --
 Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
 Spinal Cord Research Centre   WWW:
http://www.scrc.umanitoba.ca/~grdetil
 Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
 Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930

 
 To unsubscribe from the htdig mailing list, send a message to
 [EMAIL PROTECTED]
 You will receive a message to confirm this.
 List archives:  http://www.htdig.org/mail/menu.html
 FAQ:http://www.htdig.org/FAQ.html





To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] External Converter Prob

2001-01-11 Thread Gilles Detillieux

According to Reich, Stefan:
 all my descriptions are starting with "content-type: text/html".
 
 Is this normal behavior or is it, because I'm using an external converter to
 do some modifications on the spidered html files. I registered my converter
 for text/html - text/myhtml conversion. I've patched the html parser to
 recognize this in addition to text/html. 
 
 I'm sure my external converter doesn't write text/html to the output stream.
 
 Any ideas?

No, this is not normal behaviour.  If you're certain that your external
converter doesn't write this out, then we'd have to assume it comes
from elsewhere.  It may be a stupid question, but are you sure the pages
you're indexing don't contain this extra header?  I've seen defective
CGI scripts, for example, that inadvertantly output two such headers in
some situations.  Ditto for SSI pages that call CGI scripts incorrectly.
Finally, it's hard to be sure it isn't a problem with your patches
to htdig, or to your particular configuration, without being able to
see them.  I don't know if this helps or not, but it may give you a few
more places to look.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




[htdig] Unable to contact server-revisisted

2001-01-11 Thread Roger Weiss

Hi,

I'm running htdig v3.1.5 and my digging seems to be running out of steam
after it runs for anywhere from 20 minutes to an hour or so. The initial msg
was "Unable to connect to server". So, I ran it again with -v v v   to get
the error message below.

pick: ponderingjudd.xxx.com, # servers = 550
3213:3622:2:http://ponderingjudd.xxx.com/ponderingjudd/id6.html: Unable to
build
 connection with ponderingjudd.xxx.com:80
 no server running

I've replaced part of the URL with xxx to protect the innocent. The server
certainly is running and I had no trouble finding the mentioned url. Is
there some parm I need to set or limit I need to raise?
We're running an apache server with startservers =25 and minspace=10.

Thanks for your help,
Roger

Roger Weiss
[EMAIL PROTECTED]
(978) 318-7301
http://www.trellix.com



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




RE: [htdig] htdig

2001-01-11 Thread Geoff Hutchison


No regular expressions needed. You can limit URLs based on query patterns
already. See the bad_querystr attribute:
http://www.htdig.org/attrs.html#bad_querystr

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/

On Thu, 11 Jan 2001, Richard Bethany wrote:

 Geoff,
 
 I'm the SysAdmin for our web servers and I'm working with Chuck (who does
 the development work) on this problem.  Here's the "nuts  bolts" of the
 problem.  Our entire web server is set up with a menuing system being run
 through PHP3.  This menuing system basically allows local documents/links to
 be reached via a URL off of the PHP3 file.  In other words, if I try to
 access a particular page it will be accessed as
 http://ourweb.com/DEPT/index.php3?i=1e=3p=2:3:4:.
 
 In this scenario the only relevant piece of info is the "i" value; the
 remainder of the info simply describes which portions of the menu should be
 displayed.  What ends up happening is that, for a page with eight(8) main
 menu items, 40,320 (8*7*6*5*4*3*2*1) different "hits" show up in htDig for
 each link!!  I essentially need to exclude any URL where "p" has more than
 one value (i.e. - p=1: is okay, p=1:2: is not).
 
 I've looked through the mailing list archives and found a great deal of
 discussion on the topic of regular expressions with exclusions and also some
 talk of stripping parts of the URL, but I've seen nothing to indicate that
 any of this has actually been implemented.  Do you know if there is any
 implementation of this?  If not, I saw a reply to a different problem from
 Gilles indicating that the URL::normalizePath() function would be the best
 place to start hacking so I guess I'll try that.
 
 Thanks for your time!!!
 Richard



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Unable to contact server-revisisted

2001-01-11 Thread Gilles Detillieux

According to Roger Weiss:
 I'm running htdig v3.1.5 and my digging seems to be running out of steam
 after it runs for anywhere from 20 minutes to an hour or so. The initial msg
 was "Unable to connect to server". So, I ran it again with -v v v   to get
 the error message below.
 
 pick: ponderingjudd.xxx.com, # servers = 550
 3213:3622:2:http://ponderingjudd.xxx.com/ponderingjudd/id6.html: Unable to
 build
  connection with ponderingjudd.xxx.com:80
  no server running
 
 I've replaced part of the URL with xxx to protect the innocent. The server
 certainly is running and I had no trouble finding the mentioned url. Is
 there some parm I need to set or limit I need to raise?
 We're running an apache server with startservers =25 and minspace=10.

I guess the next question, if you're sure the server is running, is can
you access it from a client?  More specifically, can you access it using a
different web client on the same system as the one on which you're running
htdig (e.g. from lynx, Netscape, kfm, or some other Linux/Unix-based web
browser)?  If you can, then the problem will be to figure out why htdig
can't build the connection while other programs on the same system can.
If you can't access the server from any client program on the same
system, then the problem isn't with htdig, but with your network setup
(e.g. firewall, packet filtering, or a bad connection from that system).

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] htdig

2001-01-11 Thread Gilles Detillieux

According to Geoff Hutchison:
 No regular expressions needed. You can limit URLs based on query patterns
 already. See the bad_querystr attribute:
 http://www.htdig.org/attrs.html#bad_querystr
...
 On Thu, 11 Jan 2001, Richard Bethany wrote:
  I'm the SysAdmin for our web servers and I'm working with Chuck (who does
  the development work) on this problem.  Here's the "nuts  bolts" of the
  problem.  Our entire web server is set up with a menuing system being run
  through PHP3.  This menuing system basically allows local documents/links to
  be reached via a URL off of the PHP3 file.  In other words, if I try to
  access a particular page it will be accessed as
  http://ourweb.com/DEPT/index.php3?i=1e=3p=2:3:4:.
  
  In this scenario the only relevant piece of info is the "i" value; the
  remainder of the info simply describes which portions of the menu should be
  displayed.  What ends up happening is that, for a page with eight(8) main
  menu items, 40,320 (8*7*6*5*4*3*2*1) different "hits" show up in htDig for
  each link!!  I essentially need to exclude any URL where "p" has more than
  one value (i.e. - p=1: is okay, p=1:2: is not).
  
  I've looked through the mailing list archives and found a great deal of
  discussion on the topic of regular expressions with exclusions and also some
  talk of stripping parts of the URL, but I've seen nothing to indicate that
  any of this has actually been implemented.  Do you know if there is any
  implementation of this?  If not, I saw a reply to a different problem from
  Gilles indicating that the URL::normalizePath() function would be the best
  place to start hacking so I guess I'll try that.

I guess the problem, though, is that without regular expressions it
could mean a large list of possible values that need to be specified
explicitly.  The same problem exists for exclude_urls as for bad_querystr,
as they're handled essentially the same way, the only difference being
that bad_querystr is limited to patterns occurring on or after the last
"?" in the URL.

So, if p=1: is valid, but p=[2-9].* and p=1:[2-9].* are not, then
the explicit list in bad_querystr would need to be:

bad_querystr:   p=2 p=3 p=4 p=5 p=6 p=7 p=8 p=9 \
p=1:2 p=1:3 p=1:4 p=1:5 p=1:6 p=1:7 p=1:8 p=1:9

It gets a bit more complicated if you need to deal with numbers of two
or more digits too, because then you can allow p=1: but not p=1[0-9]:,
so you'd need to include these patterns in the list too:

p=10 p=11 p=12 p=13 p=14 p=15 p=16 p=17 p=18 p=19 p=1:1

So, while it's not pretty, it is feasible provided the range of
possibilities doesn't get overly complex.  This will be easier in 3.2,
which will allow regular expressions.

I think my suggestion for hacking URL::normalizePath() involved much more
complicated patterns, and search-and-replace style substitutions based
on those patterns.  That may still be the way to go if you want to do
normalisations of patterns rather than simple exclusions, e.g. if you're
not guaranteed to hit a link to each page using a non-excluded pattern.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




RE: [htdig] htdig

2001-01-11 Thread Richard Bethany
Title: RE: [htdig] htdig





Gilles,


That was my fear as well. For the one link below with eight menu items, I need to accept p=1: through p=8: to pick up any/all links in the submenus, but I would have to reject the other 40,312 possible combinations of values that p can have. As you stated, that would be a mite cumbersome and, if we had pages with more menu items (we do), it would become exponentially more impossible (-- can something be more impossible? How about more improbable?) to limit the accepted values.

Does the 3.2 beta release seem pretty stable? Does the regex functionality work properly? If so, perhaps I'll give that a shot. If not, I suppose I'll just dig around in the code to see if I can find a way to get it to do what we need.

Thanks for your input, Gilles!! Thanks to you too, Geoff!!
Richard Bethany
S1 Corporation


-Original Message-
From: Gilles Detillieux [mailto:[EMAIL PROTECTED]]
Sent: Thursday, January 11, 2001 12:13 PM
To: [EMAIL PROTECTED]
Cc: Richard Bethany; [EMAIL PROTECTED]
Subject: Re: [htdig] htdig



According to Geoff Hutchison:
 No regular expressions needed. You can limit URLs based on query patterns
 already. See the bad_querystr attribute:
 http://www.htdig.org/attrs.html#bad_querystr
...
 On Thu, 11 Jan 2001, Richard Bethany wrote:
  I'm the SysAdmin for our web servers and I'm working with Chuck (who does
  the development work) on this problem. Here's the nuts  bolts of the
  problem. Our entire web server is set up with a menuing system being run
  through PHP3. This menuing system basically allows local documents/links to
  be reached via a URL off of the PHP3 file. In other words, if I try to
  access a particular page it will be accessed as
  http://ourweb.com/DEPT/index.php3?i=1e=3p=2:3:4:.
  
  In this scenario the only relevant piece of info is the i value; the
  remainder of the info simply describes which portions of the menu should be
  displayed. What ends up happening is that, for a page with eight(8) main
  menu items, 40,320 (8*7*6*5*4*3*2*1) different hits show up in htDig for
  each link!! I essentially need to exclude any URL where p has more than
  one value (i.e. - p=1: is okay, p=1:2: is not).
  
  I've looked through the mailing list archives and found a great deal of
  discussion on the topic of regular expressions with exclusions and also some
  talk of stripping parts of the URL, but I've seen nothing to indicate that
  any of this has actually been implemented. Do you know if there is any
  implementation of this? If not, I saw a reply to a different problem from
  Gilles indicating that the URL::normalizePath() function would be the best
  place to start hacking so I guess I'll try that.


I guess the problem, though, is that without regular expressions it
could mean a large list of possible values that need to be specified
explicitly. The same problem exists for exclude_urls as for bad_querystr,
as they're handled essentially the same way, the only difference being
that bad_querystr is limited to patterns occurring on or after the last
? in the URL.


So, if p=1: is valid, but p=[2-9].* and p=1:[2-9].* are not, then
the explicit list in bad_querystr would need to be:


bad_querystr: p=2 p=3 p=4 p=5 p=6 p=7 p=8 p=9 \
  p=1:2 p=1:3 p=1:4 p=1:5 p=1:6 p=1:7 p=1:8 p=1:9


It gets a bit more complicated if you need to deal with numbers of two
or more digits too, because then you can allow p=1: but not p=1[0-9]:,
so you'd need to include these patterns in the list too:


 p=10 p=11 p=12 p=13 p=14 p=15 p=16 p=17 p=18 p=19 p=1:1


So, while it's not pretty, it is feasible provided the range of
possibilities doesn't get overly complex. This will be easier in 3.2,
which will allow regular expressions.


I think my suggestion for hacking URL::normalizePath() involved much more
complicated patterns, and search-and-replace style substitutions based
on those patterns. That may still be the way to go if you want to do
normalisations of patterns rather than simple exclusions, e.g. if you're
not guaranteed to hit a link to each page using a non-excluded pattern.


-- 
Gilles R. Detillieux E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930





[htdig] (Off Topic) - use of #!/usr/bin/perl in Windows environment.

2001-01-11 Thread Sphboc
I have a number of cgi scripts running in a Linux environment. First line of all such, for apache/unix, is:
 #!/usr/bin/perl 
Under Apache/Linux, this works as expected.

Just recently installed Apache (1.13) on a Windows 98 machine. 
In this environment, perl.exe(5.005_03)resides in /perl/bin. 

When I specify #!perl, vice #!/usr/bin/perl, module receives control and operates as expected. (Which makes sense, since my Win98 path statement includes /perl/bin).

That is minimally workable, but requires that the first line change back and forth, between the above two values, every time I move the module between Unix and Windows. (Other than that, combination of bit-identical source modules and Perl 5.005_03 produce entirely-consistent results on both platforms). 


=== dusr.html =
htmlheadtitleDummy Cgi Driver/title/headbody
form action = "q_time.cgi"
input type="submit" value = "runnit"/form/body/html
=== q_time.cgi 
#!perl
use strict;use Time::local;my $t = time();
print "Content-type: text/html\n\n";
print "htmlheadtitlehello world (from q_time.cgi) /title/head";print "bodyh2 body time = $t/h2/body/html";
my $q = localtime($t);print "br$qbr";




[htdig] Problem with PDF files....

2001-01-11 Thread Elijah Kagan

Dear Everyone

Hope this is the correct list to send such questions. If not, accept my
apologies.

When I run htdig on my files I get the following message when it comes to
a PDF document:

41:41:3:http://myserver/~elijah/document.pdf: PDF::parse: cannot find pdf
parser /usr/local/bin/acroread  size = 1965732 

For some reason htdig looks for an Acrobat while its config file clearly
states:

external_parsers: application/msword-text/html /usr/local/bin/conv_doc.pl \
  application/postscript-text/html /usr/local/bin/conv_doc.pl \
  application/pdf-text/html /usr/local/bin/conv_doc.pl

The conv_doc.pl exists and working and the content type received from the
server is application/pdf.

Any ideas?


Thanks,

Elijah Kagan


P.S.  I am running htdig 3.1.5 on a Debian system.






To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] Problem with PDF files....

2001-01-11 Thread Gilles Detillieux

According to Elijah Kagan:
 
 Dear Everyone
 
 Hope this is the correct list to send such questions. If not, accept my
 apologies.
 
 When I run htdig on my files I get the following message when it comes to
 a PDF document:
 
 41:41:3:http://myserver/~elijah/document.pdf: PDF::parse: cannot find pdf
 parser /usr/local/bin/acroread  size = 1965732 
 
 For some reason htdig looks for an Acrobat while its config file clearly
 states:
 
 external_parsers: application/msword-text/html /usr/local/bin/conv_doc.pl \
   application/postscript-text/html /usr/local/bin/conv_doc.pl \
   application/pdf-text/html /usr/local/bin/conv_doc.pl
 
 The conv_doc.pl exists and working and the content type received from the
 server is application/pdf.
 
 Any ideas?
...
 P.S.  I am running htdig 3.1.5 on a Debian system.

There are a few possibilities:

1) htdig isn't looking at this config file, but another one, without
the external_parsers definition;
2) there's a typo in the external_parsers definition that isn't showing up 
in the text you e-mailed above, e.g. a misspelled word or a space after
one of the backslashes at the end of the first two lines; or
3) there's a definition right above your external_parsers definition that
mistakenly ends with a backslash at the end of the line, causing your
external_parsers definition to be swallowed up by the previous line.

That htdig is attempting to invoke acroread confirms two things:  a)
the PDF file is correctly being tagged by the server as application/pdf,
and b) htdig is not seeing a usable definition of an external parser
for that content-type, for any of the reasons outlined above.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] htdig

2001-01-11 Thread Gilles Detillieux

According to Richard Bethany:
 That was my fear as well.  For the one link below with eight menu items, I
 need to accept p=1: through p=8: to pick up any/all links in the submenus,
 but I would have to reject the other 40,312 possible combinations of values
 that "p" can have.  As you stated, that would be a mite cumbersome and, if
 we had pages with more menu items (we do), it would become exponentially
 more impossible (-- can something be "more" impossible?  How about more
 improbable?) to limit the accepted values.
 
 Does the 3.2 beta release seem pretty stable?  Does the regex functionality
 work properly?  If so, perhaps I'll give that a shot.  If not, I suppose
 I'll just dig around in the code to see if I can find a way to get it to do
 what we need.

The current 3.2 beta release (b2) isn't stable.  The latest development
snapshot for 3.2.0b3 is much more so, but IMHO still not quite ready
for prime-time.  Ironically, one of the remaining problems is that long,
complex regular expressions seem to be silently failing right now,
so we still need to get to the bottom of that.

However, even you you need to reject 40,312 possible combinations of
values, it doesn't mean you'd need to explicitly list each of those,
as many of them could be covered by the same substring.  The current
handling of exclude_urls and bad_querystr does substring matching, so
there's an implied .* on either side of each string you give for these
two attributes.  Because any of 1 though 8 can be used as the intial p=
value, it makes the problem more complicated than I assumed, but not
by a huge amount.  If I understand correctly, as long as there's only
one menu value specified, it's OK, but if there are two or more, it's
not OK, and only 1 through 8 will appear as possible menu values.  Now,
a string of more than two menu values will be matched by a substring of
only two values, so all you need are all possible series of two values,
or 8 x 8 = 64 patterns, p=1:1 through to p=8:8.  Correct?

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] 3.1.3 engine on 3.1.5 db

2001-01-11 Thread Dave Salisbury

Thanks for the reply..

 If
 you created your database with htdig 3.1.5, and want to search it with
 htsearch 3.1.3, that's a bad idea.  The most glaring bug in releases
 before 3.1.5 is in htsearch, so you really should upgrade it.

I take it one of the worst things is the security hole which allows
a user to view any file with read permissions ( ouch! )

Is there any way to correct for this with a wrapper around htsearch?
Reading the indices using 3.1.3 that were created by a 3.1.5 engine
seems to work just fine.

Anyone out there want to bash Glimpse before I look into it.  
I'm hoping to get it at least to compile on an SGI.

Thank for any info.

Dave

 On the other hand, if you have an existing database built with version
 3.1.3, and want to use it with the latest htsearch, that should work
 without any difficulty.  However, you'll lose out on several benefits
 in the latest htdig (better parsing of meta tags, parsing img alt text,
 fixed parsing of URL parameters, etc.), 

Couldn't find what "fixed parsing of URL parameters" means.
The query string is part of what's indexed??

 which you'll only get if you
 reindex with htdig 3.1.5.  Maybe none of these matter for your site,
 though.  See the release notes and ChangeLog for details.

I don't think they're essential.

DS




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




RE: [htdig] htdig

2001-01-11 Thread Richard Bethany
Title: RE: [htdig] htdig





Gilles, your suggestion below worked to perfection. I didn't think about the fact that I only needed a snippet of the whole string to eliminate it. I ended up using 441 (21 x 21) bad_querystr entries. This will allow the use of up to 21 menu headings on a page. The whole `rundig` process finished in about five minutes!

Thanks!!!
Richard Bethany
S1 Corporation


-Original Message-
From: Gilles Detillieux [mailto:[EMAIL PROTECTED]]
Sent: Thursday, January 11, 2001 4:01 PM
To: Richard Bethany
Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED];
[EMAIL PROTECTED]; Chuck Umeh
Subject: Re: [htdig] htdig



According to Richard Bethany:
 That was my fear as well. For the one link below with eight menu items, I
 need to accept p=1: through p=8: to pick up any/all links in the submenus,
 but I would have to reject the other 40,312 possible combinations of values
 that p can have. As you stated, that would be a mite cumbersome and, if
 we had pages with more menu items (we do), it would become exponentially
 more impossible (-- can something be more impossible? How about more
 improbable?) to limit the accepted values.
 
 Does the 3.2 beta release seem pretty stable? Does the regex functionality
 work properly? If so, perhaps I'll give that a shot. If not, I suppose
 I'll just dig around in the code to see if I can find a way to get it to do
 what we need.


The current 3.2 beta release (b2) isn't stable. The latest development
snapshot for 3.2.0b3 is much more so, but IMHO still not quite ready
for prime-time. Ironically, one of the remaining problems is that long,
complex regular expressions seem to be silently failing right now,
so we still need to get to the bottom of that.


However, even you you need to reject 40,312 possible combinations of
values, it doesn't mean you'd need to explicitly list each of those,
as many of them could be covered by the same substring. The current
handling of exclude_urls and bad_querystr does substring matching, so
there's an implied .* on either side of each string you give for these
two attributes. Because any of 1 though 8 can be used as the intial p=
value, it makes the problem more complicated than I assumed, but not
by a huge amount. If I understand correctly, as long as there's only
one menu value specified, it's OK, but if there are two or more, it's
not OK, and only 1 through 8 will appear as possible menu values. Now,
a string of more than two menu values will be matched by a substring of
only two values, so all you need are all possible series of two values,
or 8 x 8 = 64 patterns, p=1:1 through to p=8:8. Correct?


-- 
Gilles R. Detillieux E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba Phone: (204)789-3766
Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930





[htdig] dig taking forever

2001-01-11 Thread Archive User

Hiya,

I am having some major performance problems with htdig
and I am looking for a bit of guidence. 

I have an email archive within my company with has 
about 300 mailing lists being archived to it via 
mhonarc. We have approximatly 110k emails in the 
archive each being its own html file. Probably 
99% of these emails are just standard size 4-10k
emails. This archive has been running for 4 months
now. htddig was working great for the first month
or so.. 1-2 hours tops.. After 4 months its now 
up to 52 hours to do either an updatedig or a 
rundig. The files it is generating are only ~300 meg.
Searchs work fine (when the dig finally finishes). 
Something seems quite wrong in my mind:) 
The archive and htdig both run on the same 
system which is a dual proc sun ultra 2 using solaris 2.7 
with 1.5 gig of ram. I have apache running on the same 
host with 20+ (up to 256 max) servers started by 
default. When watching the apache logs while it 
digs.. it seems to be going awfully slow.. one 
or two querys every 30 seconds. I am running 
version 3.1.5 of htdig. The system is not taxed 
at all while the dig is going on. 

Should 110k web pages really take this long? I am 
thinking a few hours tops is more like it should 
be. 

Anyone have any ideas? I am really stumped and 
need to get this dig well below 24 hours. 

Thanks.. Mike



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] dig taking forever

2001-01-11 Thread Geoff Hutchison

At 11:32 PM -0500 1/11/01, Archive User wrote:
I have an email archive within my company with has
about 300 mailing lists being archived to it via
mhonarc. We have approximatly 110k emails in the

Do you mean 300 lists * 110,000 messages, or do you mean 110,000 
messages total?

Should 110k web pages really take this long? I am
thinking a few hours tops is more like it should
be.

Yes, that's about what I'd expect myself. I'm assuming you're 
updating alternate files when you talk about an "update dig." Have 
you tried blowing away the alternates and reindexing them from 
scratch? Basically what I'm saying is to keep your existing databases 
in place and generating a new set from scratch. Are they faster? (in 
which case there's something wrong with the databases perhaps) Or is 
it the same (in which case there's a general slowdown you're seeing 
now)

Also, have you set the server_wait_time attribute or anything of that sort?

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] dig taking forever

2001-01-11 Thread Archive User

 Do you mean 300 lists * 110,000 messages, or do you mean 110,000 
 messages total?

110k total

 Yes, that's about what I'd expect myself. I'm assuming you're 
 updating alternate files when you talk about an "update dig." Have 
 you tried blowing away the alternates and reindexing them from 
 scratch? Basically what I'm saying is to keep your existing databases 
 in place and generating a new set from scratch. Are they faster? (in 
 which case there's something wrong with the databases perhaps) Or is 
 it the same (in which case there's a general slowdown you're seeing 
 now)

whether I do a rundig from scratch (destroy all the files)
or do an updatedig on a already established version it 
takes equally long. 

Also, have you set the server_wait_time attribute or anything of that
sort?

I dont think so.. I am basically running a default apache setup 
with php4 compiled in. 

Mike



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html




Re: [htdig] dig taking forever

2001-01-11 Thread Archive User

Guys,

Well I found my problem.. one of the email lists my company 
had a bunch of binary data going to it that kept getting 
encoded into the html page for somereason. I will have to
figure out why they are doing this but it looks like
htdig was choking on it. I just reran everything from scratch..
took and 1 1/2 hours to do everything:) A lot better then 
52 hours! heheh.. Hoepfully an update dig should be well 
under an hour.. I will find out tommorrow night.

Thanks for your assistance.. Mike


On Fri, 12 Jan 2001, Archive User wrote:

  Do you mean 300 lists * 110,000 messages, or do you mean 110,000 
  messages total?
 
 110k total
 
  Yes, that's about what I'd expect myself. I'm assuming you're 
  updating alternate files when you talk about an "update dig." Have 
  you tried blowing away the alternates and reindexing them from 
  scratch? Basically what I'm saying is to keep your existing databases 
  in place and generating a new set from scratch. Are they faster? (in 
  which case there's something wrong with the databases perhaps) Or is 
  it the same (in which case there's a general slowdown you're seeing 
  now)
 
 whether I do a rundig from scratch (destroy all the files)
 or do an updatedig on a already established version it 
 takes equally long. 
 
 Also, have you set the server_wait_time attribute or anything of that
 sort?
 
 I dont think so.. I am basically running a default apache setup 
 with php4 compiled in. 
 
 Mike
 
 
 
 To unsubscribe from the htdig mailing list, send a message to
 [EMAIL PROTECTED]
 You will receive a message to confirm this.
 List archives:  http://www.htdig.org/mail/menu.html
 FAQ:http://www.htdig.org/FAQ.html
 



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html