[htdig] using perl/cron to find badwords on site
Hi all, I don't know if anyone else has run across this yet, but I have a number of guestbooks and things like that where people can post and I would love to be able to find a way to set up a daily cron job with perl script that basically runs a set of badwords through htsearch and then emails me a list of just the urls it finds with those words in it... I don't really need things like the page title or description or stuff like that.. I'm assuming I'll need to use a system call in the script to some sort of command line option and loop it for each word... Any input would be greatly appreciated. Jerry To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] External Converter Prob
Hi all, all my descriptions are starting with "content-type: text/html". Is this normal behavior or is it, because I'm using an external converter to do some modifications on the spidered html files. I registered my converter for text/html - text/myhtml conversion. I've patched the html parser to recognize this in addition to text/html. I'm sure my external converter doesn't write text/html to the output stream. Any ideas? Tnx Stefan To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] keep temp files while running indexer? How to...
According to Stephen Murray: Hi Gilles, When you wrote: "Only the one on the contrib section of the FTP site and web site is current." You were referring to rundig.sh at http://www.htdig.org/contrib/ - -- right? That's the one I should use? (As Geoff suggested?) I answered this yesterday evening, after the first time you asked, but here goes again... Yes, it's the Scripts sub-section of that part of the web site, which actually takes you to the http://www.htdig.org/files/contrib/scripts/ directory. To clarify further, the one you should NOT use is the one in the contrib directory of the htdig-3.1.5.tar.gz source distribution, or any other source distribution, as these are the ones that are outdated. Using the URL you mentioned in your e-mail above will get you to the correct script in two clicks. First, the link "Scripts" in the left frame, then the link "rundig.sh" in the right frame. Using the URL I mentioned in my reply above will get you there in one click. Either way, it's the same directory and the same file, presented either with or without the frame structure on the web page. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
RE: [htdig] htdig
At 9:00 AM -0500 1/11/01, Chuck Umeh wrote: One OF the db is about 1.5GB OK, so does this seem reasonable? At the least, you're not likely running up against any OS restrictions on file size (normally the first one is 2GB for some OSes). Have you tried running htdig -i -v like I suggested? I suspect you have an infinite loop or close to that. -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] using perl/cron to find badwords on site
According to Jerry Preeper: I don't know if anyone else has run across this yet, but I have a number of guestbooks and things like that where people can post and I would love to be able to find a way to set up a daily cron job with perl script that basically runs a set of badwords through htsearch and then emails me a list of just the urls it finds with those words in it... I don't really need things like the page title or description or stuff like that.. I'm assuming I'll need to use a system call in the script to some sort of command line option and loop it for each word... Any input would be greatly appreciated. I assume that you want your htdig database updated through this same cron job, before running htsearch, so that the database you search will contain any new postings to the guestbooks. The simplest way I can think of, assuming the correct settings are already made in htdig.conf, would be a shell script with these commands... htdig htmerge /path/to/cgi-bin/htsearch "words=badword1+badword2+badword3+badword4" Of course, if you want to write it in Perl, especially if you need more processing than simply running these programs, you can call the above commands in one or more calls to the system("...") function in Perl. You may want to customise the htsearch templates to get just the URL, if that's all you want (see template_map, search_results_header and search_results_footer in http://www.htdig.org/attrs.html). If you want to search for each word separately, rather than one query for all words, then you'd need to call htsearch once for each individual word. E.g. in a shell script, you could do: htdig; htmerge for word in badword1 badword2 badword3 badword4 do echo "${word}:" /path/to/cgi-bin/htsearch "words=${word}" done or: htdig; htmerge while read word do echo "${word}:" /path/to/cgi-bin/htsearch "words=${word}" done /path/to/bad-word-file However, it seems to me it would be better to search for all at once, unless you need a word by word summary of URLs. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] External converters - two questions
Thanks Giles, that is usefull information. I had thought that perhaps pdtotext actually *added* hyphenation to a document. If the problem is removing the hyphenation that is actually written into the document then I can see that not everbody will wish to do this. It is easily switched off in doc2html.pl, but only if you know where to look. The next version will definitely be better in this respect. As for magic numbers, I'll wait and see if anybody else can offer some additional observations. -- David Adams Computing Services Southampton University - Original Message - From: "Gilles Detillieux" [EMAIL PROTECTED] To: "David Adams" [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Thursday, January 11, 2001 1:12 AM Subject: Re: [htdig] External converters - two questions According to David Adams: I hope to find time for a further revision of the external converter script doc2html.pl and possibly simplify it a little. The existing code includes de-hyphenation (which is buggy) taken originally from parsedoc.pl. The question is: is this necessary, does pdftotext (or any other utility) actually break up words across lines with the addition of hyphens? Is the hyphenation code of any use? Information and opinions are requested. I added this code for dealing with a lot of the PDFs I needed to index on my site, and for the Manitoba Unix User Group web site as well (for their newsletters). Unlike HTML documents, I've found a lot of PDF files make pretty heavy use of hyphenation. Without the dehyphenation code, hyphenated words appeared as two separate words in the resulting text. E.g. "conv- erter" was taken as "conv" and "erter", so a search for "converter" may not turn up this document if the word didn't appear unbroken elsewhere in the document. Sorry about the EOF bug in this code. It was a quick hack, and I don't know Perl all that well. There was a patch to fix this, though. Are there any other bugs? In any case, in parse_doc.pl and conv_doc.pl, I wrote it to be optional, enabled by this line: $dehyphenate = 1; # PDFs often have hyphenated lines which only applied to PDFs. The ps2ascii utility already does its own dehyphenation, but pdftotext doesn't. Other document types are less likely to need this. If dehyphenation of PDFs is not desired, it's easy enough to change the 1 to a 0 above when configuring the script. I don't recall if your doc2html.pl has the same sort of option. Also inherited from parsedoc.pl is extra code to cope with files which may be an "HP Print job" or contain a "MacBinary header". Are such files really encountered? If so what type of files are they, Word, PDF or what? Does the magic number code need to take account of them? Another hack of mine. The MUUG web site had some pretty odd-ball PostScript files on it that were causing error messages while indexing their site. Instead of simple and pure PS in these files, some had a MacBinary wrapper or HP PJL codes in them, which ps2ascii happily would skip over, but the Perl code wasn't accepting these files. These hacks were to allow these files through. Dunno if anyone else has found they help or hurt them, but I'm keeping them in my own copies of the scripts. I know they're kind of ugly, so if you want to get rid of them in your code for the sake of simplicity, I'd certainly understand. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] External Converter Prob
According to Reich, Stefan: all my descriptions are starting with "content-type: text/html". Is this normal behavior or is it, because I'm using an external converter to do some modifications on the spidered html files. I registered my converter for text/html - text/myhtml conversion. I've patched the html parser to recognize this in addition to text/html. I'm sure my external converter doesn't write text/html to the output stream. Any ideas? No, this is not normal behaviour. If you're certain that your external converter doesn't write this out, then we'd have to assume it comes from elsewhere. It may be a stupid question, but are you sure the pages you're indexing don't contain this extra header? I've seen defective CGI scripts, for example, that inadvertantly output two such headers in some situations. Ditto for SSI pages that call CGI scripts incorrectly. Finally, it's hard to be sure it isn't a problem with your patches to htdig, or to your particular configuration, without being able to see them. I don't know if this helps or not, but it may give you a few more places to look. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] Unable to contact server-revisisted
Hi, I'm running htdig v3.1.5 and my digging seems to be running out of steam after it runs for anywhere from 20 minutes to an hour or so. The initial msg was "Unable to connect to server". So, I ran it again with -v v v to get the error message below. pick: ponderingjudd.xxx.com, # servers = 550 3213:3622:2:http://ponderingjudd.xxx.com/ponderingjudd/id6.html: Unable to build connection with ponderingjudd.xxx.com:80 no server running I've replaced part of the URL with xxx to protect the innocent. The server certainly is running and I had no trouble finding the mentioned url. Is there some parm I need to set or limit I need to raise? We're running an apache server with startservers =25 and minspace=10. Thanks for your help, Roger Roger Weiss [EMAIL PROTECTED] (978) 318-7301 http://www.trellix.com To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
RE: [htdig] htdig
No regular expressions needed. You can limit URLs based on query patterns already. See the bad_querystr attribute: http://www.htdig.org/attrs.html#bad_querystr -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ On Thu, 11 Jan 2001, Richard Bethany wrote: Geoff, I'm the SysAdmin for our web servers and I'm working with Chuck (who does the development work) on this problem. Here's the "nuts bolts" of the problem. Our entire web server is set up with a menuing system being run through PHP3. This menuing system basically allows local documents/links to be reached via a URL off of the PHP3 file. In other words, if I try to access a particular page it will be accessed as http://ourweb.com/DEPT/index.php3?i=1e=3p=2:3:4:. In this scenario the only relevant piece of info is the "i" value; the remainder of the info simply describes which portions of the menu should be displayed. What ends up happening is that, for a page with eight(8) main menu items, 40,320 (8*7*6*5*4*3*2*1) different "hits" show up in htDig for each link!! I essentially need to exclude any URL where "p" has more than one value (i.e. - p=1: is okay, p=1:2: is not). I've looked through the mailing list archives and found a great deal of discussion on the topic of regular expressions with exclusions and also some talk of stripping parts of the URL, but I've seen nothing to indicate that any of this has actually been implemented. Do you know if there is any implementation of this? If not, I saw a reply to a different problem from Gilles indicating that the URL::normalizePath() function would be the best place to start hacking so I guess I'll try that. Thanks for your time!!! Richard To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Unable to contact server-revisisted
According to Roger Weiss: I'm running htdig v3.1.5 and my digging seems to be running out of steam after it runs for anywhere from 20 minutes to an hour or so. The initial msg was "Unable to connect to server". So, I ran it again with -v v v to get the error message below. pick: ponderingjudd.xxx.com, # servers = 550 3213:3622:2:http://ponderingjudd.xxx.com/ponderingjudd/id6.html: Unable to build connection with ponderingjudd.xxx.com:80 no server running I've replaced part of the URL with xxx to protect the innocent. The server certainly is running and I had no trouble finding the mentioned url. Is there some parm I need to set or limit I need to raise? We're running an apache server with startservers =25 and minspace=10. I guess the next question, if you're sure the server is running, is can you access it from a client? More specifically, can you access it using a different web client on the same system as the one on which you're running htdig (e.g. from lynx, Netscape, kfm, or some other Linux/Unix-based web browser)? If you can, then the problem will be to figure out why htdig can't build the connection while other programs on the same system can. If you can't access the server from any client program on the same system, then the problem isn't with htdig, but with your network setup (e.g. firewall, packet filtering, or a bad connection from that system). -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] htdig
According to Geoff Hutchison: No regular expressions needed. You can limit URLs based on query patterns already. See the bad_querystr attribute: http://www.htdig.org/attrs.html#bad_querystr ... On Thu, 11 Jan 2001, Richard Bethany wrote: I'm the SysAdmin for our web servers and I'm working with Chuck (who does the development work) on this problem. Here's the "nuts bolts" of the problem. Our entire web server is set up with a menuing system being run through PHP3. This menuing system basically allows local documents/links to be reached via a URL off of the PHP3 file. In other words, if I try to access a particular page it will be accessed as http://ourweb.com/DEPT/index.php3?i=1e=3p=2:3:4:. In this scenario the only relevant piece of info is the "i" value; the remainder of the info simply describes which portions of the menu should be displayed. What ends up happening is that, for a page with eight(8) main menu items, 40,320 (8*7*6*5*4*3*2*1) different "hits" show up in htDig for each link!! I essentially need to exclude any URL where "p" has more than one value (i.e. - p=1: is okay, p=1:2: is not). I've looked through the mailing list archives and found a great deal of discussion on the topic of regular expressions with exclusions and also some talk of stripping parts of the URL, but I've seen nothing to indicate that any of this has actually been implemented. Do you know if there is any implementation of this? If not, I saw a reply to a different problem from Gilles indicating that the URL::normalizePath() function would be the best place to start hacking so I guess I'll try that. I guess the problem, though, is that without regular expressions it could mean a large list of possible values that need to be specified explicitly. The same problem exists for exclude_urls as for bad_querystr, as they're handled essentially the same way, the only difference being that bad_querystr is limited to patterns occurring on or after the last "?" in the URL. So, if p=1: is valid, but p=[2-9].* and p=1:[2-9].* are not, then the explicit list in bad_querystr would need to be: bad_querystr: p=2 p=3 p=4 p=5 p=6 p=7 p=8 p=9 \ p=1:2 p=1:3 p=1:4 p=1:5 p=1:6 p=1:7 p=1:8 p=1:9 It gets a bit more complicated if you need to deal with numbers of two or more digits too, because then you can allow p=1: but not p=1[0-9]:, so you'd need to include these patterns in the list too: p=10 p=11 p=12 p=13 p=14 p=15 p=16 p=17 p=18 p=19 p=1:1 So, while it's not pretty, it is feasible provided the range of possibilities doesn't get overly complex. This will be easier in 3.2, which will allow regular expressions. I think my suggestion for hacking URL::normalizePath() involved much more complicated patterns, and search-and-replace style substitutions based on those patterns. That may still be the way to go if you want to do normalisations of patterns rather than simple exclusions, e.g. if you're not guaranteed to hit a link to each page using a non-excluded pattern. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
RE: [htdig] htdig
Title: RE: [htdig] htdig Gilles, That was my fear as well. For the one link below with eight menu items, I need to accept p=1: through p=8: to pick up any/all links in the submenus, but I would have to reject the other 40,312 possible combinations of values that p can have. As you stated, that would be a mite cumbersome and, if we had pages with more menu items (we do), it would become exponentially more impossible (-- can something be more impossible? How about more improbable?) to limit the accepted values. Does the 3.2 beta release seem pretty stable? Does the regex functionality work properly? If so, perhaps I'll give that a shot. If not, I suppose I'll just dig around in the code to see if I can find a way to get it to do what we need. Thanks for your input, Gilles!! Thanks to you too, Geoff!! Richard Bethany S1 Corporation -Original Message- From: Gilles Detillieux [mailto:[EMAIL PROTECTED]] Sent: Thursday, January 11, 2001 12:13 PM To: [EMAIL PROTECTED] Cc: Richard Bethany; [EMAIL PROTECTED] Subject: Re: [htdig] htdig According to Geoff Hutchison: No regular expressions needed. You can limit URLs based on query patterns already. See the bad_querystr attribute: http://www.htdig.org/attrs.html#bad_querystr ... On Thu, 11 Jan 2001, Richard Bethany wrote: I'm the SysAdmin for our web servers and I'm working with Chuck (who does the development work) on this problem. Here's the nuts bolts of the problem. Our entire web server is set up with a menuing system being run through PHP3. This menuing system basically allows local documents/links to be reached via a URL off of the PHP3 file. In other words, if I try to access a particular page it will be accessed as http://ourweb.com/DEPT/index.php3?i=1e=3p=2:3:4:. In this scenario the only relevant piece of info is the i value; the remainder of the info simply describes which portions of the menu should be displayed. What ends up happening is that, for a page with eight(8) main menu items, 40,320 (8*7*6*5*4*3*2*1) different hits show up in htDig for each link!! I essentially need to exclude any URL where p has more than one value (i.e. - p=1: is okay, p=1:2: is not). I've looked through the mailing list archives and found a great deal of discussion on the topic of regular expressions with exclusions and also some talk of stripping parts of the URL, but I've seen nothing to indicate that any of this has actually been implemented. Do you know if there is any implementation of this? If not, I saw a reply to a different problem from Gilles indicating that the URL::normalizePath() function would be the best place to start hacking so I guess I'll try that. I guess the problem, though, is that without regular expressions it could mean a large list of possible values that need to be specified explicitly. The same problem exists for exclude_urls as for bad_querystr, as they're handled essentially the same way, the only difference being that bad_querystr is limited to patterns occurring on or after the last ? in the URL. So, if p=1: is valid, but p=[2-9].* and p=1:[2-9].* are not, then the explicit list in bad_querystr would need to be: bad_querystr: p=2 p=3 p=4 p=5 p=6 p=7 p=8 p=9 \ p=1:2 p=1:3 p=1:4 p=1:5 p=1:6 p=1:7 p=1:8 p=1:9 It gets a bit more complicated if you need to deal with numbers of two or more digits too, because then you can allow p=1: but not p=1[0-9]:, so you'd need to include these patterns in the list too: p=10 p=11 p=12 p=13 p=14 p=15 p=16 p=17 p=18 p=19 p=1:1 So, while it's not pretty, it is feasible provided the range of possibilities doesn't get overly complex. This will be easier in 3.2, which will allow regular expressions. I think my suggestion for hacking URL::normalizePath() involved much more complicated patterns, and search-and-replace style substitutions based on those patterns. That may still be the way to go if you want to do normalisations of patterns rather than simple exclusions, e.g. if you're not guaranteed to hit a link to each page using a non-excluded pattern. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
[htdig] (Off Topic) - use of #!/usr/bin/perl in Windows environment.
I have a number of cgi scripts running in a Linux environment. First line of all such, for apache/unix, is: #!/usr/bin/perl Under Apache/Linux, this works as expected. Just recently installed Apache (1.13) on a Windows 98 machine. In this environment, perl.exe(5.005_03)resides in /perl/bin. When I specify #!perl, vice #!/usr/bin/perl, module receives control and operates as expected. (Which makes sense, since my Win98 path statement includes /perl/bin). That is minimally workable, but requires that the first line change back and forth, between the above two values, every time I move the module between Unix and Windows. (Other than that, combination of bit-identical source modules and Perl 5.005_03 produce entirely-consistent results on both platforms). === dusr.html = htmlheadtitleDummy Cgi Driver/title/headbody form action = "q_time.cgi" input type="submit" value = "runnit"/form/body/html === q_time.cgi #!perl use strict;use Time::local;my $t = time(); print "Content-type: text/html\n\n"; print "htmlheadtitlehello world (from q_time.cgi) /title/head";print "bodyh2 body time = $t/h2/body/html"; my $q = localtime($t);print "br$qbr";
[htdig] Problem with PDF files....
Dear Everyone Hope this is the correct list to send such questions. If not, accept my apologies. When I run htdig on my files I get the following message when it comes to a PDF document: 41:41:3:http://myserver/~elijah/document.pdf: PDF::parse: cannot find pdf parser /usr/local/bin/acroread size = 1965732 For some reason htdig looks for an Acrobat while its config file clearly states: external_parsers: application/msword-text/html /usr/local/bin/conv_doc.pl \ application/postscript-text/html /usr/local/bin/conv_doc.pl \ application/pdf-text/html /usr/local/bin/conv_doc.pl The conv_doc.pl exists and working and the content type received from the server is application/pdf. Any ideas? Thanks, Elijah Kagan P.S. I am running htdig 3.1.5 on a Debian system. To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Problem with PDF files....
According to Elijah Kagan: Dear Everyone Hope this is the correct list to send such questions. If not, accept my apologies. When I run htdig on my files I get the following message when it comes to a PDF document: 41:41:3:http://myserver/~elijah/document.pdf: PDF::parse: cannot find pdf parser /usr/local/bin/acroread size = 1965732 For some reason htdig looks for an Acrobat while its config file clearly states: external_parsers: application/msword-text/html /usr/local/bin/conv_doc.pl \ application/postscript-text/html /usr/local/bin/conv_doc.pl \ application/pdf-text/html /usr/local/bin/conv_doc.pl The conv_doc.pl exists and working and the content type received from the server is application/pdf. Any ideas? ... P.S. I am running htdig 3.1.5 on a Debian system. There are a few possibilities: 1) htdig isn't looking at this config file, but another one, without the external_parsers definition; 2) there's a typo in the external_parsers definition that isn't showing up in the text you e-mailed above, e.g. a misspelled word or a space after one of the backslashes at the end of the first two lines; or 3) there's a definition right above your external_parsers definition that mistakenly ends with a backslash at the end of the line, causing your external_parsers definition to be swallowed up by the previous line. That htdig is attempting to invoke acroread confirms two things: a) the PDF file is correctly being tagged by the server as application/pdf, and b) htdig is not seeing a usable definition of an external parser for that content-type, for any of the reasons outlined above. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] htdig
According to Richard Bethany: That was my fear as well. For the one link below with eight menu items, I need to accept p=1: through p=8: to pick up any/all links in the submenus, but I would have to reject the other 40,312 possible combinations of values that "p" can have. As you stated, that would be a mite cumbersome and, if we had pages with more menu items (we do), it would become exponentially more impossible (-- can something be "more" impossible? How about more improbable?) to limit the accepted values. Does the 3.2 beta release seem pretty stable? Does the regex functionality work properly? If so, perhaps I'll give that a shot. If not, I suppose I'll just dig around in the code to see if I can find a way to get it to do what we need. The current 3.2 beta release (b2) isn't stable. The latest development snapshot for 3.2.0b3 is much more so, but IMHO still not quite ready for prime-time. Ironically, one of the remaining problems is that long, complex regular expressions seem to be silently failing right now, so we still need to get to the bottom of that. However, even you you need to reject 40,312 possible combinations of values, it doesn't mean you'd need to explicitly list each of those, as many of them could be covered by the same substring. The current handling of exclude_urls and bad_querystr does substring matching, so there's an implied .* on either side of each string you give for these two attributes. Because any of 1 though 8 can be used as the intial p= value, it makes the problem more complicated than I assumed, but not by a huge amount. If I understand correctly, as long as there's only one menu value specified, it's OK, but if there are two or more, it's not OK, and only 1 through 8 will appear as possible menu values. Now, a string of more than two menu values will be matched by a substring of only two values, so all you need are all possible series of two values, or 8 x 8 = 64 patterns, p=1:1 through to p=8:8. Correct? -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] 3.1.3 engine on 3.1.5 db
Thanks for the reply.. If you created your database with htdig 3.1.5, and want to search it with htsearch 3.1.3, that's a bad idea. The most glaring bug in releases before 3.1.5 is in htsearch, so you really should upgrade it. I take it one of the worst things is the security hole which allows a user to view any file with read permissions ( ouch! ) Is there any way to correct for this with a wrapper around htsearch? Reading the indices using 3.1.3 that were created by a 3.1.5 engine seems to work just fine. Anyone out there want to bash Glimpse before I look into it. I'm hoping to get it at least to compile on an SGI. Thank for any info. Dave On the other hand, if you have an existing database built with version 3.1.3, and want to use it with the latest htsearch, that should work without any difficulty. However, you'll lose out on several benefits in the latest htdig (better parsing of meta tags, parsing img alt text, fixed parsing of URL parameters, etc.), Couldn't find what "fixed parsing of URL parameters" means. The query string is part of what's indexed?? which you'll only get if you reindex with htdig 3.1.5. Maybe none of these matter for your site, though. See the release notes and ChangeLog for details. I don't think they're essential. DS To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
RE: [htdig] htdig
Title: RE: [htdig] htdig Gilles, your suggestion below worked to perfection. I didn't think about the fact that I only needed a snippet of the whole string to eliminate it. I ended up using 441 (21 x 21) bad_querystr entries. This will allow the use of up to 21 menu headings on a page. The whole `rundig` process finished in about five minutes! Thanks!!! Richard Bethany S1 Corporation -Original Message- From: Gilles Detillieux [mailto:[EMAIL PROTECTED]] Sent: Thursday, January 11, 2001 4:01 PM To: Richard Bethany Cc: [EMAIL PROTECTED]; [EMAIL PROTECTED]; [EMAIL PROTECTED]; Chuck Umeh Subject: Re: [htdig] htdig According to Richard Bethany: That was my fear as well. For the one link below with eight menu items, I need to accept p=1: through p=8: to pick up any/all links in the submenus, but I would have to reject the other 40,312 possible combinations of values that p can have. As you stated, that would be a mite cumbersome and, if we had pages with more menu items (we do), it would become exponentially more impossible (-- can something be more impossible? How about more improbable?) to limit the accepted values. Does the 3.2 beta release seem pretty stable? Does the regex functionality work properly? If so, perhaps I'll give that a shot. If not, I suppose I'll just dig around in the code to see if I can find a way to get it to do what we need. The current 3.2 beta release (b2) isn't stable. The latest development snapshot for 3.2.0b3 is much more so, but IMHO still not quite ready for prime-time. Ironically, one of the remaining problems is that long, complex regular expressions seem to be silently failing right now, so we still need to get to the bottom of that. However, even you you need to reject 40,312 possible combinations of values, it doesn't mean you'd need to explicitly list each of those, as many of them could be covered by the same substring. The current handling of exclude_urls and bad_querystr does substring matching, so there's an implied .* on either side of each string you give for these two attributes. Because any of 1 though 8 can be used as the intial p= value, it makes the problem more complicated than I assumed, but not by a huge amount. If I understand correctly, as long as there's only one menu value specified, it's OK, but if there are two or more, it's not OK, and only 1 through 8 will appear as possible menu values. Now, a string of more than two menu values will be matched by a substring of only two values, so all you need are all possible series of two values, or 8 x 8 = 64 patterns, p=1:1 through to p=8:8. Correct? -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930
[htdig] dig taking forever
Hiya, I am having some major performance problems with htdig and I am looking for a bit of guidence. I have an email archive within my company with has about 300 mailing lists being archived to it via mhonarc. We have approximatly 110k emails in the archive each being its own html file. Probably 99% of these emails are just standard size 4-10k emails. This archive has been running for 4 months now. htddig was working great for the first month or so.. 1-2 hours tops.. After 4 months its now up to 52 hours to do either an updatedig or a rundig. The files it is generating are only ~300 meg. Searchs work fine (when the dig finally finishes). Something seems quite wrong in my mind:) The archive and htdig both run on the same system which is a dual proc sun ultra 2 using solaris 2.7 with 1.5 gig of ram. I have apache running on the same host with 20+ (up to 256 max) servers started by default. When watching the apache logs while it digs.. it seems to be going awfully slow.. one or two querys every 30 seconds. I am running version 3.1.5 of htdig. The system is not taxed at all while the dig is going on. Should 110k web pages really take this long? I am thinking a few hours tops is more like it should be. Anyone have any ideas? I am really stumped and need to get this dig well below 24 hours. Thanks.. Mike To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] dig taking forever
At 11:32 PM -0500 1/11/01, Archive User wrote: I have an email archive within my company with has about 300 mailing lists being archived to it via mhonarc. We have approximatly 110k emails in the Do you mean 300 lists * 110,000 messages, or do you mean 110,000 messages total? Should 110k web pages really take this long? I am thinking a few hours tops is more like it should be. Yes, that's about what I'd expect myself. I'm assuming you're updating alternate files when you talk about an "update dig." Have you tried blowing away the alternates and reindexing them from scratch? Basically what I'm saying is to keep your existing databases in place and generating a new set from scratch. Are they faster? (in which case there's something wrong with the databases perhaps) Or is it the same (in which case there's a general slowdown you're seeing now) Also, have you set the server_wait_time attribute or anything of that sort? -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] dig taking forever
Do you mean 300 lists * 110,000 messages, or do you mean 110,000 messages total? 110k total Yes, that's about what I'd expect myself. I'm assuming you're updating alternate files when you talk about an "update dig." Have you tried blowing away the alternates and reindexing them from scratch? Basically what I'm saying is to keep your existing databases in place and generating a new set from scratch. Are they faster? (in which case there's something wrong with the databases perhaps) Or is it the same (in which case there's a general slowdown you're seeing now) whether I do a rundig from scratch (destroy all the files) or do an updatedig on a already established version it takes equally long. Also, have you set the server_wait_time attribute or anything of that sort? I dont think so.. I am basically running a default apache setup with php4 compiled in. Mike To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] dig taking forever
Guys, Well I found my problem.. one of the email lists my company had a bunch of binary data going to it that kept getting encoded into the html page for somereason. I will have to figure out why they are doing this but it looks like htdig was choking on it. I just reran everything from scratch.. took and 1 1/2 hours to do everything:) A lot better then 52 hours! heheh.. Hoepfully an update dig should be well under an hour.. I will find out tommorrow night. Thanks for your assistance.. Mike On Fri, 12 Jan 2001, Archive User wrote: Do you mean 300 lists * 110,000 messages, or do you mean 110,000 messages total? 110k total Yes, that's about what I'd expect myself. I'm assuming you're updating alternate files when you talk about an "update dig." Have you tried blowing away the alternates and reindexing them from scratch? Basically what I'm saying is to keep your existing databases in place and generating a new set from scratch. Are they faster? (in which case there's something wrong with the databases perhaps) Or is it the same (in which case there's a general slowdown you're seeing now) whether I do a rundig from scratch (destroy all the files) or do an updatedig on a already established version it takes equally long. Also, have you set the server_wait_time attribute or anything of that sort? I dont think so.. I am basically running a default apache setup with php4 compiled in. Mike To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html