[htdig] Fw: doc2html
When I wrote doc2html I copied this without change from conv_doc, and I think it is the same in the original parse_doc parser script. Is Leong correct? -- David Adams Computing Services Southampton University - Original Message - From: "Leong Peck Yoke" [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Sunday, January 21, 2001 1:18 PM Subject: doc2html Hi, I am look at your code doc2html.pl for a project. I notice that in function try_text at line 366, the following code s/\255/-/g; # replace dashes with hyphens seems to be wrong. Shouldn't it be "s/\055/-/g" instead? Regards, Peck Yoke To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] Spelling Help
I am trying to do what I can to aid those with spelling difficulties perform searches on our web pages. This was triggered by seeing in the htsearch log that attempts to find "accomodation" were finding some pages, but not the important ones (where it is spelt correctly)! Also this University has a commitment to supporting disabled students, including those with dyslexia. I would like to ask: 1)What have other sites done to address this problem? (Spell checking and correcting our own pages is not possible at present, and may never be.) 2)Can anybody recommend a _good_ (UK English) spell checker for IRXIX 6.5? (The IRIX spell command does not know a lot of important words, such as "midwifery", and I can't figure out how to addend to the dictionary.) A spell checker that could suggest words (as do the spell checkers in word processors, etc.) would be wonderful. -- David Adams Computing Services Southampton University To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Unable to contact server-revisisted
Is this server in your local network or remote? It might be worth trying to index it via a proxy cache. I found that this cured the problem for us, though it hasn't helped everybody. Take a look at the http_proxy configuration file attribute. -- David Adams Computing Services Southampton University - Original Message - From: "Roger Weiss" [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, January 11, 2001 4:18 PM Subject: [htdig] Unable to contact server-revisisted Hi, I'm running htdig v3.1.5 and my digging seems to be running out of steam after it runs for anywhere from 20 minutes to an hour or so. The initial msg was "Unable to connect to server". So, I ran it again with -v v v to get the error message below. pick: ponderingjudd.xxx.com, # servers = 550 3213:3622:2:http://ponderingjudd.xxx.com/ponderingjudd/id6.html: Unable to build connection with ponderingjudd.xxx.com:80 no server running I've replaced part of the URL with xxx to protect the innocent. The server certainly is running and I had no trouble finding the mentioned url. Is there some parm I need to set or limit I need to raise? We're running an apache server with startservers =25 and minspace=10. Thanks for your help, Roger Roger Weiss [EMAIL PROTECTED] (978) 318-7301 http://www.trellix.com To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] External converters - two questions
Thanks Giles, that is usefull information. I had thought that perhaps pdtotext actually *added* hyphenation to a document. If the problem is removing the hyphenation that is actually written into the document then I can see that not everbody will wish to do this. It is easily switched off in doc2html.pl, but only if you know where to look. The next version will definitely be better in this respect. As for magic numbers, I'll wait and see if anybody else can offer some additional observations. -- David Adams Computing Services Southampton University - Original Message - From: "Gilles Detillieux" [EMAIL PROTECTED] To: "David Adams" [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Thursday, January 11, 2001 1:12 AM Subject: Re: [htdig] External converters - two questions According to David Adams: I hope to find time for a further revision of the external converter script doc2html.pl and possibly simplify it a little. The existing code includes de-hyphenation (which is buggy) taken originally from parsedoc.pl. The question is: is this necessary, does pdftotext (or any other utility) actually break up words across lines with the addition of hyphens? Is the hyphenation code of any use? Information and opinions are requested. I added this code for dealing with a lot of the PDFs I needed to index on my site, and for the Manitoba Unix User Group web site as well (for their newsletters). Unlike HTML documents, I've found a lot of PDF files make pretty heavy use of hyphenation. Without the dehyphenation code, hyphenated words appeared as two separate words in the resulting text. E.g. "conv- erter" was taken as "conv" and "erter", so a search for "converter" may not turn up this document if the word didn't appear unbroken elsewhere in the document. Sorry about the EOF bug in this code. It was a quick hack, and I don't know Perl all that well. There was a patch to fix this, though. Are there any other bugs? In any case, in parse_doc.pl and conv_doc.pl, I wrote it to be optional, enabled by this line: $dehyphenate = 1; # PDFs often have hyphenated lines which only applied to PDFs. The ps2ascii utility already does its own dehyphenation, but pdftotext doesn't. Other document types are less likely to need this. If dehyphenation of PDFs is not desired, it's easy enough to change the 1 to a 0 above when configuring the script. I don't recall if your doc2html.pl has the same sort of option. Also inherited from parsedoc.pl is extra code to cope with files which may be an "HP Print job" or contain a "MacBinary header". Are such files really encountered? If so what type of files are they, Word, PDF or what? Does the magic number code need to take account of them? Another hack of mine. The MUUG web site had some pretty odd-ball PostScript files on it that were causing error messages while indexing their site. Instead of simple and pure PS in these files, some had a MacBinary wrapper or HP PJL codes in them, which ps2ascii happily would skip over, but the Perl code wasn't accepting these files. These hacks were to allow these files through. Dunno if anyone else has found they help or hurt them, but I'm keeping them in my own copies of the scripts. I know they're kind of ugly, so if you want to get rid of them in your code for the sake of simplicity, I'd certainly understand. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] PDFs, numbers, and percent signs
At this stage it is not so much you being given ideas as you supplying enough information. What parser are you using? Are you using it directly or via a script such as parsedoc or doc2html. You say "25%" occurs in the parser O/P. Is that the output direct from pdftotext, or from doc2html or what? -- David Adams Computing Services Southampton University - Original Message - From: "Philip E. Varner" [EMAIL PROTECTED] To: "Geoff Hutchison" [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Wednesday, January 10, 2001 3:07 PM Subject: Re: [htdig] PDFs, numbers, and percent signs Yes, "25%" shows up in the output of the parser. I searched for a word near an instance of it in a document, and the long results print out the "25%" too. Any other ideas? Phil On Tue, 9 Jan 2001, Geoff Hutchison wrote: : At 1:52 PM -0500 1/9/01, Philip E. Varner wrote: : So, I'm guessing this is either a problem with the percent sign (25%,etc), : or not having _all_ words indexed. : : I'd run a PDF through your external parser/converter and take a look : at the output. Are you seeing 25% (or whatever) showing up there? : : -- : -Geoff Hutchison : Williams Students Online : http://wso.williams.edu/ : : : To unsubscribe from the htdig mailing list, send a message to : [EMAIL PROTECTED] : You will receive a message to confirm this. : List archives: http://www.htdig.org/mail/menu.html : FAQ:http://www.htdig.org/FAQ.html : -- A distributed system is one in which the failure of a computer you didn't even know existed can render your own computer unusable. -- Leslie Lamport To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] IRIX compile fix
This may help the query about compiling htdig under IRIX: Forwarded message: From [EMAIL PROTECTED] Thu Aug 31 14:13:11 2000 Mailing-List: contact [EMAIL PROTECTED]; run by ezmlm Precedence: bulk Delivered-To: mailing list [EMAIL PROTECTED] Date: Thu, 31 Aug 2000 14:09:47 +0100 (BST) From: Bob MacCallum [EMAIL PROTECTED] Message-Id: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Subject: [htdig] IRIX compile fix Content-Length: 999 X-Status: X-Keywords: X-UID: 252 Hello, I just managed to compile htdig for IRIX 6.5 without the o32/n32 error and also without too many warnings, using cc (not gcc). For the record, here is what I had to do, it differs a little from what it says in: http://www.mail-archive.com/htdig@htdig.org/msg00832.html ./configure edit Makefile.config to change these two lines # LIBDIRS=-L../htlib -L../htcommon -L../db/dist -L/usr/lib LIBDIRS= -L../htlib -L../htcommon -L../db/dist # LIBS= $(HTLIBS) -lz -lnsl -lsocket LIBS= $(HTLIBS) -lz -lsocket make that's it. it works, and as usual, I don't really know why... ;-) our /etc/compiler.defaults are: -DEFAULT:abi=n32:isa=mips3 I'm not subscribed to the list (yet), so any replies to me please. Bob. To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] Re: ht://dig on IRIX 6.5 (fwd)
Here is part of an old email describing another way of compiling htdig under IRIX, it works for us. I used the following script to run the configure command. I try to always run configure from a script rather than by hand - that way I don't have to remember what options I had to use to get it working. #!/bin/sh CFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CFLAGS CPPFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CPPFLAGS LDFLAGS="-mips4"; export LDFLAGS ./configure --prefix=/opt/local/htdig-3.1.2 --with-cgi-bin-dir=/opt/local/htdig-3.1.2/cgi-bin --with-image-dir=/opt/local/htdig-3.1.2/graphics --with-search-dir=/opt/local/htdig-3.1.2/htdocs/sample Clearly the important bits are the FLAG settings for C, C++ and the linker. We are using MipsPro 7.3 compilers both for the C and C++ compilers. -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] External converters - two questions
I hope to find time for a further revision of the external converter script doc2html.pl and possibly simplify it a little. The existing code includes de-hyphenation (which is buggy) taken originally from parsedoc.pl. The question is: is this necessary, does pdftotext (or any other utility) actually break up words across lines with the addition of hyphens? Is the hyphenation code of any use? Information and opinions are requested. Also inherited from parsedoc.pl is extra code to cope with files which may be an "HP Print job" or contain a "MacBinary header". Are such files really encountered? If so what type of files are they, Word, PDF or what? Does the magic number code need to take account of them? -- David Adams Computing Services Southampton University To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] Fw: [htdig] - Question for start_url and exclude_urls
Mohia, Your colleague Aditya got into the habit of emailing his Ht://Dig problems to me rather than to the htdig mailing list. As this latest query is not something I can immediately answer I am forwarding to the list. For authoritative answers to all queries please always email to [EMAIL PROTECTED] and not to me personally. - Original Message - From: "Mohai Wang" [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Thursday, January 04, 2001 7:39 PM Subject: [htdig] - Question for start_url and exclude_urls David, Aditya is been taking 3 weeks vacation from yesterday. I am going take this "htdig" search engine project. Question: 1. start_url: as long as start_url = "http://stagsite.coreon.com/download/". When I run "rundig -vvv log", I got error message from screen "DB2 problem...: missing or empty key value specified". I also attached debug mode "log" and "htdig.conf" files, please take a look. Did I set wrong option? If start_url = "http://stagsite.coreon.com/" that it will go through to write index, because I only need to write everything under "download" nothing else. 2. exclude_urls: I try to do something differently, start_url = "http://stagsite.coreon.com/" then I added exclude_urls = "/cgi-bin/ /calendar/ /coreonlib/". When I run "rundig -vvv log3", it will read /coreonlib/ first then stop. After I took off "coreonlib" from exclude_urls then rerun "rundig -vvv log2" that everything are indexing and reject "cgi-bin" and "calendar". Could you tell me why? Please take a look log3 file. Mohai Wang Coreon Inc., -- David Adams Computing Services University of Southampton log.dat htdig.conf log3.dat To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] doc2html hangs while parsing PDFs
On Wed, 03 Jan 2001 15:59:57 +0100 Berthold Cogel [EMAIL PROTECTED] wrote: Hello! I just tried to index our site with htdig-3.1.5 on a Sun UltraSparc with SunOS 5.7. To parse PDF documents I used doc2html and pdftotext. My first mistake was to leave max_doc_size at the default value. But I don't think that this was the reason for my problem: Sometimes doc2html hangs and eats resources and produces a unknown child process with defunct signature in the top list (perhaps pdftotext?). There is a known bug in the hyphenation code in doc2html.pl which causes it to loop indefinitely when parsing a .PDF file when the last character is a hyphen. This seems unlikely, but I have seen it. In sub try_text change: while (CAT) { while ( m/[A-Za-z\300-\377]-\s*$/ $set-{'hyph'}) { ($_ .= CAT) || last; s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/s; } s/\255/-/g; # replace dashes with hyphens To: while (CAT) { while ( m/[A-Za-z\300-\377]-\s*$/ $set-{'hyph'}) { $_ .= CAT; last if eof; s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/s; } s/\255/-/g; # replace dashes with hyphens I don't think that the document size is a reason for this effect, because some of the files that caused the trouble (last line in htdig.log) had a size of only 10 to 40 KByte. Some bigger files (up to 34 MByte) didn't stop doc2html. By the way: Where do I have to set $Verbose? sub init { # set = 1 for O/P on stderr if successful $Verbose = 1; Is it possible to write the messages of pdftotext and doc2html in a separate logfile? Perhaps in the next version of doc2html. Why doesn't take htdig/doc2html the complete document for parsing. You only have to take max_doc_size into account when you take the parsed documents for indexing. This might reduce the problems with doctypes other than html or plain text. max_doc_size affects all documents fetched by htdig. It is a safety device to prevent the downloading of extremely large (or infinitely long!) documents. Thanks in advance Berthold Cogel To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html -- David Adams [EMAIL PROTECTED] To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Question about parsing word, pdf, ppt etc.
Try executing the parsers at the command line to see what happens. I don't know, but it seems quite possible that the current version of ppt2html is not able to cope with the Powerpoint 2000 format. If that is the case you could try contacting the author directly. I have noticed that ppt2html can require a lot of memory (several hundred megabytes) to convert some .ppt files, could you have a problem with a shortage of memory? Are you using catdoc or wp2html to convert Word files? Wp2html extracts the 'subject' from the document summary and puts it in the header, which might be the problem. Catdoc does often include gibberish in its output, and you could find removing the -b option in the call of catdoc an improvement. Doc2html.pl uses pdfinfo to extract the title of the .PDF file, and I have seen .PDF documents where the title is 'þÿ ' for some reason. You might need to modify doc2html.pl to supress such titles. - Original Message - From: "Aditya Shah" [EMAIL PROTECTED] To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Sent: Tuesday, December 19, 2000 2:28 AM Subject: [htdig] Question about parsing word, pdf, ppt etc. Hello, We are evaluating the use of htDig for an intranet site. Our users publish a lot of Word, Excel, Powerpoint and PDF Documents that we want to be able to search through. We have been able to get all the external parsers required. We have run into the following issues: 1) Unable to parse powerpoint Documents. The documents are MS- Powerpoint 2000 Documents. We got the ppt2html parser from www.xlHtml.org . The statements in htdig.conf are something like this: application/msexcel-text/html /app/doc2html/doc2html.pl \ application/mspowerpoint-text/html /app/doc2html/doc2html.pl \ application/vnd.ms-excel-text/html /app/doc2html/doc2html.pl \ application/vnd.ms-powerpoint-text/html /app/doc2html/doc2html.pl Excel works great, but for powerpoint, when I run the 'rundig' program, it just kind of hangs there. 2) Getting gibberish in the headers for some word and pdf documents. For example, for a word document: In doc 2 html ; ; ; ; ; ; ; ; Fax Fax Please Recycle Comments: `Þ"Û? gP?]...øu-OwPÄ?+`É?0|?(ÜÐ oè?UYÆìÌ{èO?ãôrsÊ-| ?]ç* ú! Ý^mÀB?t 5?z+¿Hc-Ð#*ÄgÔ"C?ò,mÎ?Púss (_ûÛ~$Û+-V Sö?ýô?_+ywì?lt;?-? ?\...Y ... when the search results are returned. This does not happen for all word documents, only for some of them. And for a PDF document, we always get the 'þÿ ' character before any file name in the search results section. Also, do you know if there is a parser for MS-Visio? Any help would be appreciated. Thanks. Aditya Shah To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Problem with virtual server-names
- Original Message - From: [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Tuesday, November 28, 2000 1:21 PM Subject: [htdig] Problem with virtual server-names Hi, we have 2 different aliases and 1 IP-adress on one webserver: 1.) http://www.abc.de 2.) http://www2.abc.de in the config-file we set allow_virtual_hosts: true the start_url ist set to http://www2.abc.de/map1 the limit_urls is the same that start_url. When we run htdig with -c configfile, there is only one message: New server: www2.abc.de, 80 What to do? Greetings Uli If you want to index both servers then you should have: limit_urls_to:http://www2.abc.de/http://www.abc.de you have set it to index one page on one server. -- David Adams University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] Antwort: Re: [htdig] Problem with virtual server-names
On Tue, 28 Nov 2000 14:43:35 +0100 [EMAIL PROTECTED] wrote: Hi David, it is a little bit different: We have 2 aliases but i want to index only one. So i write in the configfile start_url: www2.abc.de/map1. But htdig said: unable to build connection with www2.abc.de, 80. greetings uli Then you need limit_url_to: www2.abc.de/ otherwise htdig will only index the page www2.abc.de/map1 You did not mention the "unable to build connection" error message in your first email. Perhaps your server is down or you have a network problem? ------ David Adams [EMAIL PROTECTED] To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Does htmerge remove URL from database ?
I found that the extra runs of htmerge were necessary when I was merging two runs of htdig. Unless I ran both databases through htmerge before merging them I was getting Deleted, invalid: against some pages in the htmerge run. Compared to the time required to run htdig, the extra htmerge runs are trivial, so you have little to loose by including them. Use the -v option with both htdig and htmerge and see if you get any message re the pages that don't appear in the final index. - Original Message - From: "Geoff Hutchison" [EMAIL PROTECTED] To: "Olivier Korn" [EMAIL PROTECTED] Cc: "Gilles Detillieux" [EMAIL PROTECTED]; [EMAIL PROTECTED] Sent: Sunday, November 26, 2000 4:07 AM Subject: Re: [htdig] Does htmerge remove URL from database ? At 2:21 PM +0100 11/23/00, Olivier Korn wrote: I tried it and it didn't solve the problem. BTW, I don't think that these extra merges are necessary either. No, they should not be at all necessary unless there's truly something horrific wrong with the merging code--it only uses the files directly output from htdig. (My idea was that it would be faster if you didn't need to run htmerge on intermediate DB.) Now, I run : htmerge -c site#.conf then htmerge -c site1.conf -m site#.conf (with # 1) If I then run htsearch -c site5.conf with words="rénovation tourisme", it finds the document (in first place.) But if I do htsearch -c site1.conf with the same words, it returns the "nomatch" document. Some of the web hosts are case sensitives and some are not. Could it be the source of my problem ? I wouldn't think so. But you have to be pretty careful that the URL encodings are shared between your site.conf files. Personally, I make up a "main.conf," include that in the other files and only set the start_url and a minimal number of things in the individual site.conf files. In particular, it makes it easy to change something in all config files at once. -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] different search results
Gilles R. Detillieux wrote: According to gkalter: Hope this mailing-list is the right one..;-) Today I got htdig to work pretty well on a site containing many PDF-Files. Cobalt Raq2 micorserver (mips) with RedHat based Linux After updating the C++ Compiler (see mailing list) I got rid of the segmenatition error messages and htdig worked well. Cryptic outputs of the search form were solved by adding a ".cgi" extension to htsearch in the local cgi-bin folder. Solution also found in the list - thanks to all those helpful people! I think the FAQ also has some pointers on getting the CGI to work. Because I wanted to get direct links to single PDF Pages out of the found excerpts I got the pdftodig.py script for external parsing of PDF-Files. (Do I have to mention that python IS NOT installed on Cobalt Raqs?) O.K. this problem could also be solved. It would also be a fairly trivial change to the perl scripts conv_doc.pl or doc2html.pl to make it replace form feeds in pdftotext output with the correct HTML a name="..." tags for the anchors. You'd then be using an external converter, rather than an external parser, and possibly avoiding parser-related problems. Now everything works pretty good with one little exception. Using a complete search string e.g. "Sensor" lists all matching documents and the text contains the search word (bold typeface) with a link to the specific single Page of the found PDF file. (Great!) I think I may be missing something here, perhaps somebody can explain for me. Am I right in thinking that the whole and only point of this is to produce, in the lists produced by htsearch, excerpts from the first page of .PDF documents containing a search word? Or does one really get a link which when followed brings up the .PDF document open at the relevant page? If so, that would be quite something, especially if it worked for a range of browsers. What would be the correct HTML a name="..." tags for the anchors? -- David Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] htdig and MSWord
Hello, I operate with htdig since a short time and have now the following question: If it is possible to search the content of MSWORD documents (Version 6.0, 7.0, WinWord 2000) using HTDIG? or if there is another search mechanism which could do it?? Markus Fabritius -- Sent through GMX FreeMail - http://www.gmx.net Yes, using an external parser, specified by an external_parsers: statement in the configuration file. On the htdig web site click on "Contributed work" and then "External Parsers". You should use either doc2html.pl or conv_doc.pl, they are both Perl scripts which call various utility programs to do the actual conversion. Do not use the old parse_doc script. Doc2html.pl gives you a choice of either wp2html (very cheap commercial product) or catdoc (public domain) to convert Word files. -- David Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] Antw: Re: [htdig] Problems with parse_doc.pl and German Umlaute
I'm glad that doc2html works OK for you. In Perl $WP2HTML = ""; and $WP2HTML = ''; are equivalent. On Thu, 26 Oct 2000 9:23:37 +0200 [EMAIL PROTECTED] wrote: Thanks for your help! Your tool works perfectly especially with German Umlaute. The description in the Details-File was very helpfull, so it was no problem for one who has no experience with perl to use doc2html. But there is one little annotation for the Details-File. In the install description you write: If you don't have a particular utility then set its location as a null string. For example: $WP2HTML = ''; I don't know but I think you mean $WP2HTML = ""; or? Christian Huhn [EMAIL PROTECTED] 25.10.2000 15.41 Uhr Hi, I want to index PDF-Files with German Umlaute (ä, ö, ü, ß). Some tests had shown me that htdig (v. 3.1.5) and xpdf (v. 0.91) are working pretty good with German Umlaute, but the external parser parse_doc.pl has problems with them. It splits words with Umlaute in two words without the Umlaut. For example: w beim41 0 w diesj 45 0 w hrigen 50 0 w den 58 0 w Platz 62 0 In this case the German word "diesjährigen" is split in "diesj" and "hrigen" and I can find both with htsearch. Does anyone know how to solve this problem for example with a modified version of parse_doc.pl? Thanks, Christian Huhn You could try the doc2html parser. I think that the latest version, available from the Ht://Dig web site, will not split words this way, but I have not tested it thoroughly. If doc2html does not parse your .PDF files properly, then email an example to me personally, and I'll make sure that the next version of doc2html works correctly. -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton -- David Adams [EMAIL PROTECTED] To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Problems with parse_doc.pl and German Umlaute
Hi, I want to index PDF-Files with German Umlaute (ä, ö, ü, ß). Some tests had shown me that htdig (v. 3.1.5) and xpdf (v. 0.91) are working pretty good with German Umlaute, but the external parser parse_doc.pl has problems with them. It splits words with Umlaute in two words without the Umlaut. For example: w beim41 0 w diesj 45 0 w hrigen 50 0 w den 58 0 w Platz 62 0 In this case the German word "diesjährigen" is split in "diesj" and "hrigen" and I can find both with htsearch. Does anyone know how to solve this problem for example with a modified version of parse_doc.pl? Thanks, Christian Huhn You could try the doc2html parser. I think that the latest version, available from the Ht://Dig web site, will not split words this way, but I have not tested it thoroughly. If doc2html does not parse your .PDF files properly, then email an example to me personally, and I'll make sure that the next version of doc2html works correctly. -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Can't get my search to update correctly..
According to Rivera, Tony: However, that's not working...about 5 days ago I added a new directory /www/itss and have made numerous links to it from my index page and various other pages on the server and it is still not getting picked up when I do a search for it. I assume these are HTML links and not JavaScript ones. One possibility is that the pages you modified are actually dynamic content (SSI, PHP, etc.) and so the server isn't returning a Last-Modified header. If this is the case, htdig won't realize the pages have been modified. You can set the modification_time_is_now attribute to true, but then htdig will reindex all dynamic pages every time it runs. This seems to depend on the server. The Southampton University server (Apache) does SSI and allows !--#include virtual=... and !--#echo var="LAST_MODIFIED" -- but little else. It does not put Last-Modified into the header when serving an SSI page. However, even with modification_time_is_now true it still returns "not changed" unless the file *has* been modified since the last run of htdig. But a departmental server we also index only gives "not changed" on its .ps and .pdf files. -- David Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] ... but not changed
According to Geoff Hutchison: On Wed, 4 Oct 2000, David Adams wrote: It had not occured to me that an SSI file was "dynamic", I live and learn! Yes, if you think about it for a second, you can realize that there's no way for the server to be entirely sure of the modification date for SSI files. It *could* send the date of the file itself, but what if it includes a file that has changed, or an actual CGI? Well, if the server were REALLY smart about it, it could keep track of the most recently modified include file or main file, and use that as the last modified date. It would only need to suppress the header if CGI output is included in the mix. Of course, the latter case would probably account for about 90% of SSI usage. :) -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 We certainly don't allow CGI output in SSI on our server, but I've no way of knowing if that is unusual. -- David Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] ... but not changed
A simple query (I hope). When, during an update run, htdig says of a page: "retrieved but not changed", how does htdig decide that the page is the same as the last time? An author is maintaining that she added a link to a page and that an update run of htdig failed to follow the new link(s) she had added. -- David Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Different domains?
Quoting Ken Convery [EMAIL PROTECTED]: ht://dig looks like a great tool for maintainers of intranets and/or several internet web servers. I have a question about it's application to something we are trying to do here. We are developing relationships with a few other online companies and want to make content from their sites available by link on our site. We are thinking we can use ht://dig to index those other sites so we can search out and display the pertinent information on our site in summary form and provide the link to a specific page on their site for more information. In a nutshell: can ht://dig index other web servers specified outside my domain or network? Yes. I maintain a "local community" index which now covers almost a thousand servers (real and virtual), most of them commercial. I would recommend that you access them through a proxy, specified by the http_proxy: statement in the configuration file. If so would we need other than http to these other servers or any special access such as file system privileges? No, but https servers are a special case, I can't answer for them. secondly are there any problems with sites that generate content dynamically? Or will ht://dig simply look at static HTML pages or other static documents? There are usually no difficulties with dynamic pages, but problems can occur. The exclude_urls: statement is intended to trap them. In my case I only have exclude_urls: referer= I suggest caution, adding sites one by one to your search list, and keeping max_hop_count and server_max_docs low at first. Thank you very much Ken Convery Avian Pilot Systems Inc. David Adams [EMAIL PROTECTED] Computing Services Southampton University To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] same page, different ranking?
Quoting Mike Lewis [EMAIL PROTECTED]: Hi, I've installed Htdig to test it for use on our site (we currently run Netscape Enterprise server and I don't like the built-in search). I have a problem. If I search for the word 'john' (boss' name) at http://kmi.open.ac.uk/search/ the top two pages found are boss' home page - but one gets 4 'stars' while the other gets only 1. The same result for a considerable number of other searches ('marc', 'paul', 'simon'). I've had a look through the list archive but can't find an answer. Can anyone suggest why this might be happening? Thanks, Mike -- Systems Administrator, Knowledge Media Institute (KMi) The Open University, Walton Hall, Milton Keynes MK7 6AA UK http://kmi.open.ac.uk/ Work: +44 (0) 1908 652832 Mobile: +44 (0) 7990 536490 The one page with two URL's, yes? Then the answer must be in the "description_factor". To quote the manual: "Plain old "descriptions" are the text of a link pointing to a document. This factor gives weight to the words of these descriptions of the document. Not surprisingly, these can be pretty accurate summaries of a document's content." The word "john" probably occurs in links to http://kmi.open.ac.uk/people/domingue/, but not in links to http://kmi.open.ac.uk/people/domingue/john.html To test this theory add to your configuration file: description_factor: 0 and rebuild your index from scratch. You might wish to consider whether to keep description_factor: 0 permanently. It's what we do. Also I would suggest you attempt to sort out the mess of having one page with two URLs, though perhaps that is easier said than done. David Adams [EMAIL PROTECTED] Computing Services Southampton University To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] Htmerge: Deleted, invalid
Quoting Gilles Detillieux [EMAIL PROTECTED]: According to David Adams: I use the standard MIPSpro compiler. The script I use (thanks to my former collegeaue James Hammick) to setup the Makefile is: #!/bin/sh CFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CFLAGS CPPFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CPPFLAGS LDFLAGS="-mips4 -L/usr/lib32 -rpath /opt/local/htdig-3.1.5/lib"; export LDFLAGS ./configure --prefix=/opt/local/htdig-3.1.5 \ --with-cgi-bin-dir=/opt/local/htdig-3.1.5/cgi-bin \ --with-image-dir=/opt/local/htdig-3.1.5/graphics \ --with-search-dir=/opt/local/htdig-3.1.5/htdocs/sample A lot of that is site-specific, and the "-rpath directory" option is only needed because the compression library is not in a standard place on the machine on which htdig is run. The "-woff all" option suppresses most warning messages. I will remove it, recompile htdig and send the result directly to Gilles, it might contain a clue. As Sinclair mentioned, 'you need to have the 2.95.2 gcc and the latest gnu "make".' I don't know that anyone has ever gotten ht://Dig to work with SGI's own compiler. If fact, we got a lot of reports from folks who couldn't even get it to compile. If you're really determined to get to the bottom of this and make it work with the SGI compiler, I wish you well, but I doubt I can help much. I looked at the output you sent me, and didn't really see any red flags pointing to an obvious problem. I know that the Serialize and Deserialize functions for the db.docdb records can be a tad finicky, so that would probably be a place to look. There could also be problems with incorrect assumptions about word sizes, e.g. if the SGI compiler has 64-bit long ints. I'd also look at the db.wordlist records (they're ASCII text) before and after htmerge, to see if htdig is actually telling htmerge to remove some of these documents, or if htmerge is deciding to do so on its own. For the time being, the ht://Dig code hasn't had much of a workout on non-GNU compilers, so it doesn't seem to do well on them. If you can help remedy that, great. If you want to get the package working as quickly and easily as possible, I'd suggest trying the GNU C and C++ compilers. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 I have been using htdig (3.1.2 and then 3.1.5) on an IRIX system for about a year and I have been very pleased with it. I would say that we've given it a good workout here. The problem with the "Deleted, invalid" messages only occurs with a second, relatively new search index. The first index is made from a single run of htdig covering 33 servers, all in the local domain, and on this week's initial dig htmerge reports 49,233 documents and not a single "Deleted, invalid". The second index is made from two runs of htdig covering a total 969 (yes 969 !) servers using a proxy. Htmerge reports a mere 3,096 documents and 86 "Deleted, invalid". I have looked at the db.wordlist files (which are written to only by htdig - is that right?) and it would appear that htdig is flagging the pages for htmerge to delete and is not finding any words in them. I can advance these theories: It is not a bug, but is due to the use of a proxy. (I use a proxy because without one, a portion of the sites on any run of htdig were found to be not responding or even unknown. With a proxy, htdig appears to have no such problems.) It is a bug due to the use of a proxy. It is a bug which only shows when compiled under IRIX. It is a bug which only occurs when there many different servers. I intend to re-build the second index using htdig -vvv and perhaps learn something. -- David Adams [EMAIL PROTECTED] Computing Services Southampton University To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] Htmerge: Deleted, invalid
Quoting Gilles Detillieux [EMAIL PROTECTED]: IRIX 6.5, Htdig 3.1.5 One of the symptoms is that there is no consistency. Today's re-index reported 84 pages to be invalid. Of these only one was from the http://www.tregalic.co.uk/sacred-heart/ site, and this time it was churchpage7.html. And that page is *NOT* found by any search on my index, though I can follow links to it from other pages and browse it. I don't see how you can investigate this yet, but unless people put in reports like mine you will always be able to claim the "no-one else is having this problem". I will continue to look for a pattern which might give a clue. I'm inclined to think this is a platform-specific problem. Most of the trouble reports we've seen about IRIX systems are from users who can't even get htdig compiled, let alone running, so I don't think the package has had a thorough workout under IRIX. Which compier did you use to build it? -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 That is a possibilty worth pursuing. I use the standard MIPSpro compiler. The script I use (thanks to my former collegeaue James Hammick) to setup the Makefile is: #!/bin/sh CFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CFLAGS CPPFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CPPFLAGS LDFLAGS="-mips4 -L/usr/lib32 -rpath /opt/local/htdig-3.1.5/lib"; export LDFLAGS ./configure --prefix=/opt/local/htdig-3.1.5 \ --with-cgi-bin-dir=/opt/local/htdig-3.1.5/cgi-bin \ --with-image-dir=/opt/local/htdig-3.1.5/graphics \ --with-search-dir=/opt/local/htdig-3.1.5/htdocs/sample A lot of that is site-specific, and the "-rpath directory" option is only needed because the compression library is not in a standard place on the machine on which htdig is run. The "-woff all" option suppresses most warning messages. I will remove it, recompile htdig and send the result directly to Gilles, it might contain a clue. -- David Adams [EMAIL PROTECTED] Computing Services Southampton University To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
[htdig] Htmerge: Deleted, invalid
Why does htmerge 3.1.5 flag some pages, which look OK to me, as "Deleted, invalid" and not index them? This is happening not just with .html pages but also .doc and .pdf files. It happens with a simple merge following a run of htdig -i -a and also when two htdig runs are merged using the htdig -m option. David Adams [EMAIL PROTECTED] Computing Services Southampton University To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] .pdf and .doc-files
On Thu, 8 Jun 2000 09:12:32 -0500 (CDT) Gilles Detillieux [EMAIL PROTECTED] wrote: According to Andre Reuber: I am beginner in operating with htdig. Ist there any possibility to make a index on .doc, .pdf, .xls, ... files? Do I need any extra source? Where can I get this source. See http://www.htdig.org/FAQ.html#q4.8 and http://www.htdig.org/FAQ.html#q4.9 The .xls files may be a bit more of a challenge. I'd recommend using doc2html for .doc .pdf, and if you find and install the Excel to HTML converter, xlHtml, you could probably add it to doc2html as an extra converter fairly easily (if you have at least a minor understanding of Perl). I don't think it is quite so simple: doc2html.pl (and parse_doc and conv_doc) only use the "magic number" of the file to decide which utility to use for conversion. MS Word and Excel files can have the same magic number. The easy solution is a separate conversion script for excel files. The sophisticated solution is a more advanced script which uses the information on MIME type passed to it. I hadn't heard of xlHTML and would like to know more. As an alternative, there is a simple .xls to .csv conversion program available from the same site as catdoc. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 ------ David Adams [EMAIL PROTECTED] To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] local_user_urls not working
On Thu, 25 May 2000 19:04:06 +0800 [EMAIL PROTECTED] wrote: Dear Sir, I installed the htdig on a Red Hat Linux 6.1 system. Basically the htdig is working fine with the Apache server, but the local_user_urls setting never works. As the attrs.html suggests, I have the following directive specified in my /etc/htdig/htdig.conf: local_user_urls:http://host.mydomain/=/home/,/public_html/ I ran rundig several times but those files under /home/user/public_html never got indexed. It seems that htdig just skipped that part since I could not find any thing related from the script of "rundig -vvv". Any suggestions/comments? Could you help? Thanks in advance. Kind regards, Brian Chiangmailto:[EMAIL PROTECTED] Philips Research East Asia - Taipei 24FA, 66, Sec. 1 Chung Hsiao W. RdTel: +886 2 2382 4593 PO Box 22978, Taipei 100, TaiwanFax: +886 2 2382 4598 I suggest that you run htdig -i again with the local_user_urls: statment commented out. That should reveal whether the problem really is with local file access or is somewhere else. ------ David Adams [EMAIL PROTECTED] To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
[htdig] Ampersand in URL
I have found two problems with using htdig where the URL contains an ''. I am using htdig version 3.1.2, so perhaps these are fixed problems? The first is when the URL contains a bare '' and has to be passed to an external parser. For example, I get in the htdig log: 6:6:2:http://www.soton.ac.uk/~dja/timeline.ps: sh: line.ps: not found size = 70146 The problem does not appear to be in parse_doc.pl. The second is when a page's author has mistakenly marked up a bare '' in the URL as 'amp;'. This is - of course - wrong, and htdig does not find the page. For example: 11:11:2:http://www.soton.ac.uk/~dja/testamp;test2.html: not found However, Netscape Navigator 4.05 (and probably other browsers) fixes this up and presents a link to http://www.soton.ac.uk/~dja/testtest2.html I tried setting translate_amp: true in the configration file in the vain hope that this would produce a similar fix. Is there an alternative to trying to persuade authors that their URLs are wrong even though they work with the usual browsers? Thanks. -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] Custom factors?
On Thu, 16 Dec 1999, Simon Blake wrote: select name=url and /select. Is this a straightforward way to achieve this? Looking at the factor system, it struck me that a neat way to do This is a reasonable way to do this. Try the noindex_start attribute. See http://www.htdig.org/attrs.html#noindex_start this would be with a custom factor - you define the start and end tags, maybe with a regexp, and everything in between gets the relevant weight. Right now custom factors aren't supported, but we're looking at how to do this sort of thing in the 3.2 code. -Geoff Hutchison Williams Students Online http://wso.williams.edu/ Is it possible to have more than one pair of noindex_start and noindex_end? If so, what is the syntax? -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] parse_doc.pl alterations
According to David Adams: I have downloaded the parse_doc.pl script, and the xpdf and catdoc utilities, and I am now using them to extend our search index to include Word and PDF files. It all works well and with a bit of alteration to the Perl script does exactly what I want. My thanks to the developers! I forgot to ask before, what were your alterations? Something very specific to your needs, or something worth sharing with other? -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 Well, since you ask, I noticed two problems with PDF files on our site: 1. the titles were often meaningless, having no connection with the contents. 2. pdftotext outputs some spurious non-ascii gibberish that is then indexed. I modified the code which outputs the title to always include the type, and to put any extracted title in double quotes or the filename in square brackets: # if no title use filename from URL if (not length($title)) { $title = $ARGV[2]; $title =~ s#^.*/##; $title = '[' . $title . ']'; } else { $title = '"' . $title . '"'; } print "t\t$title ($type Document)\n"; To throw away the spurious "words" I simplified the code to replace all non-alphanumerics with spaces. I appreciate that many people would think that too drastic: while (CAT) { while (/[A-Za-z\300-\377]-\s*$/ $dehyphenate) { $_ .= CAT || break; s/([A-Za-z\300-\377])-\s*\n\s*([A-Za-z\300-\377])/$1$2/ } $head .= " " . $_; #s/\s+[\(\)\[\]\\\/\^\;\:\"\'\`\.\,\?!\*]+|[\(\)\[\]\\\/\^\;\:\"\'\`\.\$ #s/[\255]/-/g; # replace dashes with $ s/\W/ /g; # replace non-alphanumeric characters with spaces s/\s+/ /g; # replace multiple spaces, etc. with a single space @fields = split;# split up line next if (@fields == 0); # skip if no fields (do$ for ($x=0; $x@fields; $x++) { # check each field if s$ if (length($fields[$x]) = $minimum_word_length) { push @allwords, $fields[$x];# add to list } } } The spurious output is nolonger indexed, but it does remain in the head, so there is further room for improvement. -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
Re: [htdig] Reducing the importance of pages.
Is it possible to reduce the importance of certain pages? We have some pages on our site that are directories and contain thousands of entries. As a result they always seem to come up as top results whenever we search for anything. I don't really want to remove these pages from a search but I would like them tol appear lower down the list. Is this at all possible (perhaps by using negative weighting or similar?)? Thanks! -- -- Jason Carvalho Web Analyst Cranfield University [EMAIL PROTECTED] You could increase the weighting of other pages by encouraging the use of META NAME="keywords" CONTENT="...list of keywords..." and META NAME="description" CONTENT="...relevent text..." in their headers. On our site we have increased the weighting of keywords to 200. You might consider not indexing the directory pages atall by placing META NAME="robots" CONTENT="noindex" in their headers. Links in them will still be followed, but htdig will not index the words in them. -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You'll receive a message confirming the unsubscription.
[htdig] META name=robots
I am using ht://Dig version 3.1.2 and I have been trying to prevent some sets of pages from being indexed by inserting in the head of certain index pages: META NAME="robots" CONTENT="noindex, nofollow" This only seemed to work in one case and not in others. In the one case where it _did_ work I found I had written: META NAME="robots" CONTENTS="noindex, nofollow" Can anyone independently confirm or deny this? I would be happy to learn I am mistaken and that ht://Dig does conform to the HTML standard for META. -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
Re: [htdig] Comparisons with SWISH++
Does anyone know of any benchmark tests comparing performance and functionality of SWISH++ vs ht://Dig? I only ask because a client has mentioned SWISH++ as a possible alternative. I guess that from the Web stats (I have heard very little mention of SWISH++ in the outside world) there is the suggestion that it's a tool of choice. I had a look at it and it didn't seem to be terribly well documented (yeah, source-code documentation, but I want to USE the product, not mess around inside its code...). Just wondered really whether anyone had any OBJECTIVE comments they could add, or perhaps knew of somewhere I could maybe find out some of this? Regards, Phil Coates. I've no experience of SWISH++ but we have been using SWISH, and more recently SWISH-E, at this site for a few years. The original SWISH was only capable of indexing filestore, and given a top directory would index all files, descending into all subdirectories. We use SWISH-E in this mode, which is complementary to ht://Dig and most other search engines which follow links. SWISH-E has one facility which ht://Dig does not have and is very attractive for some specialist applications: it allows a search which will _only_ find pages which contain a given search word or words in a particular META tag. Thus you can have, for example, pages which contain METAs such as: META NAME="Author" Content="Ransome, A." META NAME="Title" Content="Swallows and Amazons" and then allow a search on Author or on Title, etc. Librarians and others doing systematic cataloguing find this a welcome feature. -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
Re: [htdig] Leading reasons for htdig not finding known matches?
Does anyone know what the leading reasons are for htdig not returning results for known matches. For Example: If I query the database and then get a result that has "champion" in the title and then try to search on "champion" it returns a "not found" result. This is a new database that just completed "rundig" so I wouldn't think there is a problem. Any ideas? Charlie Four possibilities: 1) "champion" is in the "bad word" list. 2) The score for "Title" has been set to zero. 3) The page has more than one Title ../title. 4) You have hit a bug in htdig 3.1.2 which results in punctuation in the page head not being stripped out. If you have, for example: Title"Champion" says Ray!/title This may cause the "words": "champion" says ray! to be indexed. Can anyone positively confirm that that bug is fixed in 3.1.3 ? -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
[htdig] Bad Words in the Search String
The University of Southampton's main web server is now using ht:/Dig to provide a search facility, http://www.search.soton.ac.uk/soton/, and we are very pleased with it. It is much better than the Harvest search engine we had before. Looking through the search log I've noticed a small problem which I would like advice on. It appears that an "All" search on a search string containing a word in the bad_word_list will fail. For example, a search on "The Staff Club" finds nothing, while a search on "Staff Club" finds the Staff Club page. More seriously for us, a search for "New College" fails, while a search on "College" finds the pages on The University of Southampton New College. Does this mean I have to prune the bad_word_list right down? Is there something else I could do? It is interesting that a two-letter word in a search string does not cause failure, even though the minimum word length is set at three. -- David Adams Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
[htdig] Logo design
I gather that work is still going on to produce a new logo for ht://dig. Could I appeal to designers to include a very small image as well as a main logo? I have been asked to remove the current ht://dig logo from our new search pages as it "dominates the page". I think I could get away with a thumb-nail sized image provided it was an official ht://dig graphic. -- David Adams Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
Re: [htdig] What is a word?
According to David Adams: I am using htdig 3.1.2, and my config file includes: extra_word_characters: _ valid_punctuation: !@#$%^*()-+|~=`{}[]:";'?,./ I find that the word database build by htdig includes many words that contain or end in a comma or other punctuation. For example: arts, i:2514 l:1 w:49950 assessed, i:2523 l:1 w:49950 atmospheric,i:2529 l:1 w:49950 b.sc, i:120 l:1 w:49950 b.sc, i:16406 l:1 w:49950 b.sc, i:16409 l:1 w:49950 b.sc, i:3039 l:1 w:49950 b.sc, i:3040 l:1 w:49950 b.sc, i:3041 l:1 w:49950 ba, i:17l:1 w:49950 I believe part of the problem may be the left quote (`) character in the list above, which is taken as the start of a file expansion (e.g. `filename`). As there's no file called "{}[]:";'?,./", the left quote and everything after it is lost from the valid_punctuation list. You'd need to escape the left quote with a backslash (\). The same thing goes for the dollar sign ($), only in this case it's just that one character that's lost. Still, that wouldn't explain why the comma and period get entered into the database. This would suggest that those characters were in the extra_word_characters list, or were erroneously treated as alphanumeric by your locale's LC_CTYPE tables. Am I misunderstanding the documentation on "valid_punctuation"? I can't figure out how the configuration file attributes extra_word_characters and valid_punctuation work together. What happens when the same character is in both? The lists should not overlap, but if they do, I believe valid_punctuation overrides, so the overlapping characters do get stripped out of the word. Essentially, both lists indicate which punctuation marks or other characters can be used within a word, but the valid_punctuation characters get stripped out before the word is put in the database. E.g. words like post-doctoral and nutsbolts go into the database as postdoctoral and nutsbolts, unless you move the hyphen or ampersand from valid_punctuation to extra_word_characters, in which case the characters stay in the word. Additionally, with the compound word patch I posted last week, and which will be in future releases, the word will be split up at places that have a non-alphanumeric character that's in valid_punctuation, but not in extra_word_characters. Thus, a word like post-doctoral will go into the database as postdoctoral, post and doctoral. Why doesn't the documented list of default characters for valid_punctuation include the question mark (?) and the doublequote (")? This is because these characters aren't commonly used within words, unlike apostrophes, ampersands, hyphens and slashes. Also, when you set allow_numbers to index numbers as words, these numbers may contain some of these characters: .-/#$% , and that's why they're in the default list. I don't know why _!^ are in the default list, but I suspect they may be used for indexing source code. If a given punctuation mark should ALWAYS separate words, it should not be added to this list. What separates words, is it whitespace only? White space or any punctuation character (actually, any non-alphanumeric character) not listed in extra_word_characters or valid_punctuation. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 Thanks for you swift and full response Gilles, I was certainly mistaken as to the use of valid_punctuation. I am left with four points: 1) Documentation The ht://dig documentation is excellent, but could I suggest the following text to replace the "description" of valid_punctuation in the online documentation: Any punctuation character (that is, any non-alphanumeric character, see allow_numbers) not either in extra_word_characters or valid_punctuation is treated the same as a space - it merely acts as a word separator. However, when a valid_punctuation character occurs within a word it is removed leaving a single word. For example, if the minus sign is in valid_punctuation, then the word "post-war" will be indexed as "postwar", and a search for either "post-war" or "postwar" will find it. However, if the minus sign is not in valid_punctuation then "post-war" will result in "post" and "war" being indexed instead. 2) Characters in valid_punctuation Not only should I have had \` and \$ in valid_punctation but I should not have included the star (*) atall. This is the default prefix_match_charact
[htdig] What is a word?
I am using htdig 3.1.2, and my config file includes: extra_word_characters: _ valid_punctuation: !@#$%^*()-+|~=`{}[]:";'?,./ I find that the word database build by htdig includes many words that contain or end in a comma or other punctuation. For example: arts, i:2514 l:1 w:49950 assessed, i:2523 l:1 w:49950 atmospheric,i:2529 l:1 w:49950 b.sc, i:120 l:1 w:49950 b.sc, i:16406 l:1 w:49950 b.sc, i:16409 l:1 w:49950 b.sc, i:3039 l:1 w:49950 b.sc, i:3040 l:1 w:49950 b.sc, i:3041 l:1 w:49950 ba, i:17l:1 w:49950 Am I misunderstanding the documentation on "valid_punctuation"? I can't figure out how the configuration file attributes extra_word_characters and valid_punctuation work together. What happens when the same character is in both? Why doesn't the documented list of default characters for valid_punctuation include the question mark (?) and the doublequote (")? What separates words, is it whitespace only? Thanks -- David Adams Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.
[htdig] htmerge -v output
I have just started using ht://Dig and I have lots of questions, but here is just one that is bothering me. When I run htmerge with the -v option I get lots of lines beginning: Deleted, no excerpt: what does this mean? Is that page not indexed? For what reason(s) would there be no excerpt for a page, and should it bother me? -- David Adams Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word unsubscribe in the SUBJECT of the message.