Re: [htdig] db.docdb db.wordlist
Check the FAQ: http://www.htdig.org/FAQ.html#q5.16 You don't need to pre-create any of the database files. They're created as needed. The problem is htdig isn't finding anything that it can index. You need to find out why. The FAQ should give you a starting point. According to Cormac Robinson: I'm running 3.1.5. Unfortunatly I'm building the search engine on a server which was setup and is maintained over in England. Knowing them they probably installed red hat 6. from source. Can anybody send me a blank db.docdb db.wordlist I'll try running it again if I get these files. If not, I might try get an earlier version of the search engine. Cormac. What version are you running? Did it come in one of Red Hat's .RPM packages or did you downloaded them? I'm using 3.1.5 (I think that's the right version number...), which came with RH 6.2 Pro. I installed the RPM, modified the config to the server's homepage (you may add another urls, it's up to you), then ran htdig. After that, I ran htmerge, and that was it. Had a little problem with user's accounts, which was promptly fixed (these guys really know their stuff), but I had no much problem. same thing here On Sat, 20 Jan 2001, Cormac Robinson wrote: I've just installed htdig and after running rundig I get the following error htmerge: Unable to open word list file '/home/httpd/docs/search/test/db/db.wordlist DB" problem ...: /home/httpd/docs/search/test/db/db.docdb no such file or directory. I've created a file in the db directory db.docdb - but I then get an error that it is not the correct file format. Is there an initial startup file I need to run on a first run... As far as I know the server I am running on is a redhat linux 6.0 Thanks To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html -- Noel Vargas Baltodano [EMAIL PROTECTED] Gerente de Sistemas Nicatechnologies, S.A. http://www.nicatech.com.ni To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Re: Fw: doc2html
According to Leong Peck Yoke: Actually I was wrong, the code means replacing the soft hyphen \255 with \055. I didn't read it carefully. Sorry for the inconvenience caused. Regards, Peck Yoke No problems. The octal code 055 is the ASCII hyphen (-), while 255 octal is the ISO-8859-1 code for soft hyphen, which oddly enough is often used to encode a long dash rather than a soft hyphen. This little substitution was something I added to parse_doc.pl, and kept in conv_doc.pl, because it solved a problem for me in dealing with some of my PDFs. I don't think it should pose a problem for anyone else, but if it ever does it's easily removed from the script. David Adams wrote: When I wrote doc2html I copied this without change from conv_doc, and I think it is the same in the original parse_doc parser script. Is Leong correct? -- David Adams Computing Services Southampton University - Original Message - From: "Leong Peck Yoke" [EMAIL PROTECTED] To: [EMAIL PROTECTED] Sent: Sunday, January 21, 2001 1:18 PM Subject: doc2html Hi, I am look at your code doc2html.pl for a project. I notice that in function try_text at line 366, the following code s/\255/-/g; # replace dashes with hyphens seems to be wrong. Shouldn't it be "s/\055/-/g" instead? Regards, Peck Yoke To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Avoiding search on file name
According to Loys Masquelier: In fact, it seams that htsearch results are directories and files where the searched word is inside the directory or file name. Ex : /foo/foo.html Searched word : foo Result : /foo /foo/foo.html Is there a way to avoid htsearch to find those directories and files. That's exactly what I thought the problem was. Setting description_factor to 0 and reindexing should prevent the foo.html file from coming up in a search for foo, but suppressing the foo directory is a little more tricky. For that, you should look into the suggestions in the "new ask" thread from this past September, at http://www.htdig.org/mail/2000/09/index.html#111 Thanks. Loys. Gilles Detillieux a crit : Or perhaps, if I understand correctly, setting description_factor to 0 and reindexing would be the way to avoid this. If you point htdig to a directory that doesn't contain an index.html or equivalent file, and the web server automatically generates the directory index, then the file names will be used as link description text for the links to those files. If that's what is happening here, then you want to tell htdig not to put any weight on the words appearing in link description text, as above. According to Peterman, Timothy P: I think setting "title_factor" to 0 in the config file will do that. You'll probably need to reindex for that change to take effect. Loys Masquelier wrote: Hello, I have a problem in indexing a file hierarchy. Htdig by default indexes all the names of all the files. When I search for a word, if that word is found in a file name, htsearch return the file path. But I only want files which contain the given word. Is there a way to avoid that file name indexing ? Thanks in advance. Best regards. Loys. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Spelling Help
According to Geoff Hutchison: At 1:34 PM + 1/18/01, David Adams wrote: 1)What have other sites done to address this problem? (Spell checking and correcting our own Use good fuzzy methods, including the synonym file. We are working on additional fuzzy matching code, but of course if anyone can come up with sample code that produces a list of suggestion words from an input, we can probably port it. 2)Can anybody recommend a _good_ (UK English) spell checker for IRXIX 6.5? Yes. Try ispell with the UK dictionaries. Back in October, Greg Holmes posted a python wrapper script for htsearch, which used ispell to suggest alternative spellings. The thread that ensued is at http://www.htdig.org/mail/2000/10/index.html#295 The ispell package is GNU software, so it should port to IRIX easily enough, I'd think, and the dictionaries are very customisable. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Indexing a given list of file
According to Loys Masquelier: I want to check that it is not possible to index a list of changed files without reindexing all the data. In fact the situation is that I know that that list of files needs to be reindexed and I want to do that as fast as possible. You may be out of luck with the current version. However, doing an update dig is usually much, much faster than reindexing from scratch. htdig will ask the server for each document in the database, but only if it's been modified since the last indexing run, so it can do this pretty quickly. There was talk of adding to the 3.2 code a feature whereby you can tell htdig not to recheck all the indexed documents, but only check a given list of URLs. I don't remember if this feature is already in the current development snapshots. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Memory requriements
According to Pat Lennon: I have a Linux box with approx 1 gig of html and pdf books. I want to use htdig for the search engine. I dont want to assume to much butwill 1 additional gig of hard disk cover the size of the index database. I figure double may be a safe starting point. Also what type of memory requirements should i consider at a minimum? The hardware is a Cyrix 150 64 meg ram redhat 6.2 apache webserver. I know this is a vague question...I would just like some reasonable starting points??? I don't know for sure, but my gut reaction is that 64 MB of RAM is pretty small for a 1 GB web site. I'd think you'd at least want to double that. However, it may work with what you have, although probably quite a bit more slowly. The web site size to database size ratio is a little hard to predict. It depends a lot on how much of your web site is indexable text vs. unindexable content (e.g. images, etc.), and what your config attribute settings are (max_doc_size, max_head_length). A 1:1 ratio is probably safe enough for htdig 3.1.5, but databases tend to be bigger in 3.2.0bx. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] solaris 2.6 and htdig 3.1.5
According to Ronald Edward Petty: I think its 3.1.5(whatever the latest stable is). Anyways I emailed yesterday about this ares:/export/netapp/user/rpy/htdig-3.1.5/htfuzzy/ make c++ -o htfuzzy -L../htlib -L../htcommon -L../db/dist -L/usr/lib Endings.o EndingsDB.o Exact.o Fuzzy.o Metaphone.o Soundex.o SuffixEntry.o Synonym.o htfuzzy.o Substring.o Prefix.o ../htcommon/libcommon.a ../htlib/libht.a ../db/dist/libdb.a -lz -lnsl -lsocket /usr/local/lib/gcc-lib/sparc-sun-solaris2.6/2.95.2/libgcc.a: could not read symbols: Bad value collect2: ld returned 1 exit status make: *** [htfuzzy] Error 1 and now I was wondering, everywhere I search on the net I get the impression that gcc is calling the wrong linkers. I type as -version and its the gnu assembler in my path, and same for id. So I am assuming that there is a version of the solaris as or id that is messing me up. 2 questions 1) Is it possible there is another problem that can be generating this? I ask this so I dont have to manually link all this, i have never done that before so maybe i should to learn... gee 2) If noone thinks it is another problem... how can "watch" the make file call the linker,etc if I use top it doesn't show. I do where ld and get 5 choices, and if i do /asdf/asdf/asdf/ld - version on 2 of them I get gnu and the other 3 i get invalid option, could these maybe be the solaris versions I cant tell , there is no option listed to tell. HELP(whinny voice) Thanks Ron Petty Try compiling some different C++ code on your system. I'm almost certain that this is a problem with the setup of GNU C++ on your particular machine, and not an ht://Dig problem. If so, then this is not the best place to get help, and you'd probably have better luck on a GNU C++ related mailing list or newsgroup. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] Re: Reindex
According to Elsa Chan: We just launched a new site, but the search engine is indexing pages that don't exist anymore. I think I just need to restart htdig except I don't know how. I trying search for info on theb htdig web site but I couldnjt find anything. Would you be able to help me? Running the standard "rundig" script will rebuild your database from scratch. You can also manually run "htdig -i" and "htmerge" to do this. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] /usr/include/netinet/in.h:837: syntax error before `struct' on
According to Ramesh Veema: While I do a make on my application ported to SOLARIS8 in the middle of the make i get the following error when when a C file tries to include netinet/in.h and i having doubt since this header file supports ip6 aswell, so Iam not clear how to correct this error, Pls help me if any on came across with this error. /usr/include/netinet/in.h:837: syntax error before `struct' /usr/include/netinet/in.h:838: syntax error before `struct' What application are you porting here? This is a mailing list for the ht://Dig search engine only. If that's the application in question, please send in the complete output from ./configure and make, as well as information on which version you're compiling, and what patches, if any, you've made. If you're talking about some other application, I'm afraid you have the wrong list. In any case, it sounds like something is overriding a header file definition in a way that's incompatible with what /usr/include/netinet/in.h expects. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Can't exclude a directory in results
Last I heard, Mindspring was still using a rather ancient beta release of htdig-3.0.8b2, which had numerous bugs. The exclude parameter handling didn't work correctly until version 3.1.0b2. Also, we had problems with the StringMatch class used to implement the restrict and exclude parameters to htsearch, among other things, until 3.1.0b4. That may be the cause of these problems. Also, you must make absolutely sure that you only have one definition of the input parameters "restrict" and "exclude" in your search form, as versions before 3.1.0b4 didn't handle multiple parameter definitions for these. The current stable release is 3.1.5, which has been out for almost a year now, and fixes these and many, many other bugs. According to Dudley Jane: TKO, This isn't exactly what you're doing, but, we have a form to restrict, but we couldn't get it to work until we said: input type=hidden name=config value=htdig input type=hidden name=restrict value="www.co.henrico.va.us/hr" It didn't work if we just said value="/hr" - we had to add the www.etc. in front. JD Carrot-Top Creative wrote: I've searched and double checked the instructions on how to exclude a directory from results and can't get it to work. I'm using htdig on MindSpring which means I can't do any custom configuration to the server or have more than one htdig install. I need to have 2 different search pages that return different results. I've successfully created a search form that uses restrict: input type=hidden name=config value="www57080" input type=hidden name=restrict value="/98study/" input type=hidden name=exclude value="" But I can't get exclude to work using this code to not include a particular directory. input type=hidden name=config value="www57080" input type=hidden name=restrict value="" input type=hidden name=exclude value="/98study/" Help what can I do? tko -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] hidden keywords
According to Stephen L Arnold: I'm trying to achieve some requested behavior with htdig; ie, I've setup some selections in the search page (using restrict and exclude) and the desired behavior is to have the user enter nothing in the search field and have htdig serve up a list of all documents in the directories specified by restrict/exclude. All documents are Word docs (at the moment) and previously I had added the keyword "doc" to the bottom of the search form (in the hidden keyword field) and it worked. However, that was when Apache was configured to allow directory indexing (and the indexes would show up in the search results, along with the documents). I turned off the Apache auto-index stuff, and built a single html file with URLs for all the documents for htdig to do the actual dig. However, and here's the rub, now I get a boolean search error when I submit a search with no keywords, even if I put more hidden keywords in the search form (that are guaranteed to be in the documents). The only thing that changed was the Apache auto-index stuff; is there anything I can do to get the behavior I want back again? I'm not sure how you had it working in the first place, if this was with 3.1.5. You must have had some value in the "words" input parameter, because htsearch 3.1.5 (and earlier) doesn't like it when you have "keywords" but no "words". Here's a patch that fixes this: ftp://ftp.ccsf.org/htdig-patches/3.1.5/any_keywords.0 The Apache auto-index stuff made the keyword "doc" match any .doc file, because the index uses the file names as link description text for the link to the file, and with a non-zero description_factor, these words have weight in the search. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] Re: Reindex
According to Elsa Chan: I try doing that, but only one file gets updated from htdig. /usr/local/htdig/db/db.docdb is the only file that gets updated. db.docs.index is still old and db.wordlist.new is created by it has 0 bytes When I try to run htmerge it gives me htmerge: Unable to open word list file '/usr/local/htdig/db/db.wordlist' As FAQ 5.16 explains, this happens because htdig didn't index any documents. I also try running htdig -vvv, but I get this 1:0:http://www.site.net/ New server: www.site.net, 80 I specify in the config file to used a different port and I put the url in quotes but it doesn't seem to work properly Any ideas? You can't use quotes in the start_url, because htdig doesn't parse it as a quoted string list. See http://www.htdig.org/attrs.html The port number should be tacked right on to the end of the URL with a colon, e.g. start_url: http://www.site.net:8001 As for figuring out why it's hanging, and what constitutes a long while, please see Geoff's response. -Original Message- From: Gilles Detillieux [mailto:[EMAIL PROTECTED]] Sent: Wednesday, January 17, 2001 10:18 AM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: Reindex According to Elsa Chan: We just launched a new site, but the search engine is indexing pages that don't exist anymore. I think I just need to restart htdig except I don't know how. I trying search for info on theb htdig web site but I couldnjt find anything. Would you be able to help me? Running the standard "rundig" script will rebuild your database from scratch. You can also manually run "htdig -i" and "htmerge" to do this. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] Re: Problem with exclude_url
According to [EMAIL PROTECTED]: we have htdig 3.15. we wanted to index a big directory of the SAP-documentation the structure is as follows: directory1 directory2 directory3 directory4 content.html frameset.html directory1 directory2 directory3 other_directory4 content.html frameset.html and so on... We want to exclude all (!) files named frameset.htm in all directories. when i made: exclude_url: frameset.htm - nothing happend I think, that you must take the qualified path - but there are so many different paths in this case. I nedd something like: exclude_url: /directory1/ directory2/directory3/*/frameset.htm (the asterix is important) Is this possible? First of all, please see http://www.htdig.org/FAQ.html#q1.16 Such questions should go to the list, not to me personally. This isn't a one-man show. Secondly, could you elaborate on what you mean by "nothing happened"? Do you mean that htdig didn't index anything, or that the frameset.htm or frameset.html files were not excluded? Also, is the above a typo, or did you really omit the "s" from exclude_urls? See http://www.htdig.org/attrs.html for correct spellings of attribute names. Thirdly, there is no wildcard support for exclude_urls. In version 3.2, we're adding support for regular expressions to exclude_urls and other attributes, which will be like wildcards only more powerful, but with a somewhat more complicated syntax. This is still a work in progress, however. You shouldn't need wildcards for this case, though, because it's a pretty simple exclusion you're trying to do here. However, if the only links to some of your files, such as the content.html files, are in the frameset.html, then you may not want to exclude them, or you'll end up missing a whole lot more besides. This is why I asked what you mean by "nothing happened". If not of the files were indexed, this may be why. Remember that htdig only follows HTML links from one document to the next. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] how do you index local pages in 3.1.5?
According to Jon Beyer: This is probably a really easy thing, but I can't get htdig to index HTML from my hard drive. I tried setting start_url to file:/, but that didn't work and I played around with local_urls_only and local_urls but couldn't get it to work. Any advice is greatly appreciated. Thanks. htdig 3.1.5 doesn't handle file:/ URLs, only http://... URLs. You can make local_urls work with this style of URL, if the documents are on the same system as the one on which you run htdig, using a syntax similar to this example from my system: start_url: http://www.scrc.umanitoba.ca/ local_urls: http://www.scrc.umanitoba.ca/=/home/httpd/html/ local_user_urls:http://www.scrc.umanitoba.ca/=/home/,/public_html/ where /home/httpd/html corresponds to my Apache DocumentRoot setting. Note that local_urls only indexes a certain limited set of file types, determined by file extension. For any other file type, or for directory URLs where there's no index.html, it falls back to the HTTP server. See http://www.htdig.org/attrs.html#local_urls -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] solaris 2.6 and htdig 3.1.5
According to Ronald Edward Petty: Sorry for the questions.. I just thought someone said that there is a shared memory problem on solaris with htdig using that, and that u should use a certain linker (namely gnu) instead of (solairs). I think you're confusing two unrelated responses to other problems. There's a problem with shared library support, not shared memory, and it affects Solaris systems running the 3.2.0 betas of htdig only. It is resolved by using the --disable-shared option to ./configure. This doesn't affect 3.1.5, because it's not a problem with the standard C++ libraries, but only the libraries built in the htdig package. 3.1.5 doesn't build any shared libraries. I doubt anyone recommended using a GNU linker. We often recommend using GNU make if there are problems with some of htdig's Makefiles on some platforms. If GNU makes a linker for Solaris, it's the first I hear about it. Usually, the GNU compilers will use the linker that comes with the target operating system, if I understand things correctly. However the make file that htdig comes with i cannot really figure out if there is a certain linker to use.. this is a htdig thing not gcc. If i do the proper compiler and linker etc then its a gnu problem... that is the question, and i have not found an answer from the htdig site about what is the proper set up of compiler, linker, assembler. I will ask the gnu people and see if this makes any since to them.. thanks for the help. It's very, very rare to call the linker or assembler directly from a Makefile for standard C or C++ code. The htdig Makefiles certainly don't attempt to do this! Generally, for linking C programs, it's the C or C++ front-end (e.g., cc, gcc, c++, g++) that gets called, and this front-end is preconfigured to call the correct linker and pass it all the required libraries. Problems such as you reported are a symptom of a mis-configured front-end to the C or C++ compilers, or incompatible libraries, or both. So, this is indeed a gcc thing. The same goes for the assembler: gcc will typically call the first and second stage back-end compilers for a .c file, to create a temporary .s file, and then call the assembler to assemble it into a .o file. If you can compile and link other C++ programs, it may be a problem with the libraries your htdig Makefiles are telling the compiler to use, but it could also be that your other programs are simple enough that they don't run into similar compatibility issues. In any case, the htdig Makefiles don't call the linker directly. If they call the wrong front-end compiler, or use the wrong libraries or library directories, you may need to change that in your Makefile.config, but this would be a problem specific to your installation. Lots of users have build htdig successfully on Solaris, with nothing like the sort of errors you reported occurring. It might help to compare the options to gcc or g++, or whatever front-end is used for linking the .o's and .a's in htdig, to the options used for C++ programs you were able to link successfully. Especially the -l and -L options. That might point the way to the problem you're having. E.g., if you have different versions of libstdc++ or libg++ in different directories, don't point g++ to an incompatible version with one of your -L options! -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Problems compiling 3.20b2
According to richard: compiling went fine, after I installled zlib-1.1.3. But running rundig -c my.conf: Arithmetic Exception - core dumped core file from htfuzzy. Oh, right. On Solaris, you must use the --disable-shared option on ./configure to avoid this problem. We still haven't gotten to the bottom of this, but C++ objects in shared libraries don't seem to get initialized properly on Solaris, causing this error. Avoiding shared libraries for ht://Dig's C++ classes avoids this problem. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Avoiding search on file name
Or perhaps, if I understand correctly, setting description_factor to 0 and reindexing would be the way to avoid this. If you point htdig to a directory that doesn't contain an index.html or equivalent file, and the web server automatically generates the directory index, then the file names will be used as link description text for the links to those files. If that's what is happening here, then you want to tell htdig not to put any weight on the words appearing in link description text, as above. According to Peterman, Timothy P: I think setting "title_factor" to 0 in the config file will do that. You'll probably need to reindex for that change to take effect. Loys Masquelier wrote: Hello, I have a problem in indexing a file hierarchy. Htdig by default indexes all the names of all the files. When I search for a word, if that word is found in a file name, htsearch return the file path. But I only want files which contain the given word. Is there a way to avoid that file name indexing ? Thanks in advance. Best regards. Loys. -- Tim Peterman - Web Master, ITP Unix Support Group Technical Lead Lockheed Martin EIS/NESS, Moorestown, NJ To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Problem with PDF files....
According to Elijah Kagan: Gilles, I greatly appreciate your help! Thanks! There are two parameters in Apache config file that tell it to add a charset field by default. They are: AddDefaultCharset and AddDefaultCharsetName. The first one should be set to off to prevent Apache from replying with a charset field set after the content type. After disabling AddDefaultCharset htdig worked as expected. Thanks again, Elijah That's good to know. However, I'd like to know if the 2nd patch I sent you yesterday fixes the problem, even with AddDefaultCharset enabled. If it's not too much bother, would you mind giving it a try and letting me know? Thanks. On Mon, 15 Jan 2001, Gilles Detillieux wrote: According to Elijah Kagan: I run htdig 3.1.5. I tried both the Debian package and a compiled one with the same result. I am absolutely sure there is something stupid I forgot to put into the configuration. Attached is the config file. Thanks for your help. Elijah On Fri, 12 Jan 2001, Gilles Detillieux wrote: According to Elijah Kagan: 1. I run htdig with an explicit -c option, so it uses the correct conf file. 2. I rewrote the external_parsers so it includes only one line... 3. ..and it is the first line in the file Results are the same! It is still looking for an acroread! Please, help. I am getting desperate... Hmm. You're sure you're running version 3.1.5 of htdig, and you don't have a pre-3.1.4 binary of htdig kicking around that you might be unknowingly running instead? External converter support was added to the external_parsers attribute only in version 3.1.4 and above. If you're sure this isn't the problem either, please send me a copy of your conf file as it stands now (preferably uuencoded right on your htdig box to prevent e-mail mangling of it), and I'll have a look and try a test or two. Oh, another thing. You mentioned this was on a Debian system. Did you compile htdig yourself, or did you use a pre-compiled binary? If the latter, which one? OK, it took a while, but the light finally came on! If you look up the following thread on the mailing list archives: http://www.htdig.org/mail/2000/09/index.html#75 you'll see that the bug has come up before. I think there's something about the Debian configuration for Apache that causes it to add the "; charset=..." string to the Content-Type header, which is the source of the problem here. At least I strongly suspect it must be the same problem, as I can't see anything else that would explain the behaviour you're reporting. If you run htdig -vvv -i -c ..., you can then look at the header lines returned by your server for the PDF files, and see if the Content-Type header does indeed have something on the line after the application/pdf string. Geoff and I made some hacks to ExternalParser.cc in the 3.2.0b3 development code to address this, but none of this has been backported to 3.1.5 yet. I'll see if I can backport some or all of the external parser patches to 3.1.5 in the next day or two. In the meantime, you can try working around this either by using local_urls, if you're running htdig on the same machine as your Apache server, or by using the same hack that Klaus used, i.e. add a line like the following to your external_parsers definition. "application/pdf; charset=iso-8859-1-text/html" /usr/share/htdig/conv_doc.pl -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Problem with PDF files....
According to Elijah Kagan: I run htdig 3.1.5. I tried both the Debian package and a compiled one with the same result. I am absolutely sure there is something stupid I forgot to put into the configuration. Attached is the config file. Thanks for your help. Elijah On Fri, 12 Jan 2001, Gilles Detillieux wrote: According to Elijah Kagan: 1. I run htdig with an explicit -c option, so it uses the correct conf file. 2. I rewrote the external_parsers so it includes only one line... 3. ..and it is the first line in the file Results are the same! It is still looking for an acroread! Please, help. I am getting desperate... Hmm. You're sure you're running version 3.1.5 of htdig, and you don't have a pre-3.1.4 binary of htdig kicking around that you might be unknowingly running instead? External converter support was added to the external_parsers attribute only in version 3.1.4 and above. If you're sure this isn't the problem either, please send me a copy of your conf file as it stands now (preferably uuencoded right on your htdig box to prevent e-mail mangling of it), and I'll have a look and try a test or two. Oh, another thing. You mentioned this was on a Debian system. Did you compile htdig yourself, or did you use a pre-compiled binary? If the latter, which one? OK, it took a while, but the light finally came on! If you look up the following thread on the mailing list archives: http://www.htdig.org/mail/2000/09/index.html#75 you'll see that the bug has come up before. I think there's something about the Debian configuration for Apache that causes it to add the "; charset=..." string to the Content-Type header, which is the source of the problem here. At least I strongly suspect it must be the same problem, as I can't see anything else that would explain the behaviour you're reporting. If you run htdig -vvv -i -c ..., you can then look at the header lines returned by your server for the PDF files, and see if the Content-Type header does indeed have something on the line after the application/pdf string. Geoff and I made some hacks to ExternalParser.cc in the 3.2.0b3 development code to address this, but none of this has been backported to 3.1.5 yet. I'll see if I can backport some or all of the external parser patches to 3.1.5 in the next day or two. In the meantime, you can try working around this either by using local_urls, if you're running htdig on the same machine as your Apache server, or by using the same hack that Klaus used, i.e. add a line like the following to your external_parsers definition. "application/pdf; charset=iso-8859-1-text/html" /usr/share/htdig/conv_doc.pl -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Phrases
According to Bill Vick: We have tried both the current and beta versions and are having problems getting the phrase search to work correctly and consistently. Any patches or should we hang tight for the next version? What to you mean by current? If you mean the current stable release, 3.1.5, it does not support phrase searching, as explained in FAQ 1.9. The 3.2.0b2 beta, the last one released, has a number of known bugs. The upcoming 3.2.0b3 beta should be much more reliable than the last beta. You can wait for it, or you can try the latest development snapshot of it... http://www.htdig.org/files/snapshots/htdig-3.2.0b3-011401.tar.gz -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Problems compiling 3.20b2
According to Richard van Drimmelen: I'm trying to compile 3.20b2 on a Sparc Solaris 7 machine with gcc 2.95.2 During 'make': ld: warning: symbol `Object type_info node' has differing alignments: (file Endings.o value=0x8; file ../htlib/libht.a(StringMatch.o) value=0x4); largest value applied Undefined first referenced symbol in file __eh_pc Endings.o ld: fatal: Symbol referencing errors. No output written to htfuzzy collect2: ld returned 1 exit status make[1]: *** [htfuzzy] Error 1 Any suggestions ? I can't say for sure that the next beta will solve this problem, but could you please try the latest development snapshot of it to see if it does? The 3.2.0b2 beta has a number of known bugs and many compilation problems that are fixed in the upcoming 3.2.0b3 beta. You can try the latest development snapshot of it at... http://www.htdig.org/files/snapshots/htdig-3.2.0b3-011401.tar.gz In either case, let us know whether or not it solves this problem, so we can know if it still needs fixing before releasing it. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] PATCH: backport ExternalParser.cc from 3.2.0b3 to 3.1.5
According to Elijah Kagan: I run htdig 3.1.5. I tried both the Debian package and a compiled one with the same result. I am absolutely sure there is something stupid I forgot to put into the configuration. OK, after getting to the bottom of this (I think!), I have backported the 3.2.0b3 development code for htdig/ExternalParser.cc to version 3.1.5, to fix this and other problems. Please give this patch file a try and let me know if it works. You will probably get a warning about the wait() function being implicitly declared, unless you manually define HAVE_WAIT_H or HAVE_SYS_WAIT_H (depending on whether your system has wait.h or sys/wait.h). Also, if your system has the mkstemp() function, you may want to define HAVE_MKSTEMP manually as well, as this will enhance security. I didn't have time to figure out how to patch aclocal.m4 and configure to add tests for all of these. The patch fixes the following problems in external_parsers support in 3.1.5: - it got confused by "; charset=..." in the Content-Type header, as described in "http://www.htdig.org/mail/2000/09/index.html#75". - security problems with using popen(), and therefore the shell, to parse URL and content-type strings from untrusted sources (now uses pipe/fork/exec instead of popen) - PR#542, PR#951. - used predictable temporary file name, which could be exploited via symlinks - fixed if mkstemp() exists HAVE_MKSTEMP is defined. - binary output from an external converter could get mangled. - error messages were sometimes ambiguous or missing altogether. - didn't open temporary file in binary mode for non-Unix systems (attempts were made to fix this, but it's not clear yet whether the security fixes and pipe/fork/exec will port well to Cygwin). Here's the patch, which you can apply in the main source directory for htdig-3.1.5 using "patch -p0 this-file": --- htdig/ExternalParser.cc.origThu Feb 24 20:29:10 2000 +++ htdig/ExternalParser.cc Mon Jan 15 13:18:47 2001 @@ -1,14 +1,24 @@ // // ExternalParser.cc // -// Implementation of ExternalParser -// Allows external programs to parse unknown document formats. -// The parser is expected to return the document in a specific format. -// The format is documented in http://www.htdig.org/attrs.html#external_parser +// ExternalParser: Implementation of ExternalParser +// Allows external programs to parse unknown document formats. +// The parser is expected to return the document in a +// specific format. The format is documented +// in http://www.htdig.org/attrs.html#external_parser // -#if RELEASE -static char RCSid[] = "$Id: ExternalParser.cc,v 1.9.2.3 1999/11/24 02:14:09 grdetil Exp $"; -#endif +// Part of the ht://Dig package http://www.htdig.org/ +// Copyright (c) 1995-2001 The ht://Dig Group +// For copyright details, see the file COPYING in your distribution +// or the GNU Public License version 2 or later +// http://www.gnu.org/copyleft/gpl.html +// +// $Id: ExternalParser.cc,v 1.9.2.4 2001/01/15 13:18:47 grdetil Exp $ +// + +#ifdef HAVE_CONFIG_H +#include "htconfig.h" +#endif /* HAVE_CONFIG_H */ #include "ExternalParser.h" #include "HTML.h" @@ -19,9 +29,18 @@ static char RCSid[] = "$Id: ExternalPars #include "QuotedStringList.h" #include "URL.h" #include "Dictionary.h" +#include "good_strtok.h" + #include ctype.h #include stdio.h -#include "good_strtok.h" +#include unistd.h +#include stdlib.h +#include fcntl.h +#ifdef HAVE_WAIT_H +#include wait.h +#elif HAVE_SYS_WAIT_H +#include sys/wait.h +#endif static Dictionary *parsers = 0; static Dictionary *toTypes = 0; @@ -32,9 +51,18 @@ extern StringconfigFile; // ExternalParser::ExternalParser(char *contentType) { + String mime; + int sep; + if (canParse(contentType)) { - currentParser = ((String *)parsers-Find(contentType))-get(); +String mime = contentType; + mime.lowercase(); + sep = mime.indexOf(';'); + if (sep != -1) + mime = mime.sub(0, sep).get(); + + currentParser = ((String *)parsers-Find(mime))-get(); } ExternalParser::contentType = contentType; } @@ -89,6 +117,8 @@ ExternalParser::readLine(FILE *in, Strin int ExternalParser::canParse(char *contentType) { + int sep; + if (!parsers) { parsers = new Dictionary(); @@ -97,7 +127,6 @@ ExternalParser::canParse(char *contentTy QuotedStringListqsl(config["external_parsers"], " \t"); String from, to; int i; - int sep; for (i = 0; qsl[i]; i += 2) { @@ -109,11 +138,22 @@ ExternalParser::canParse(char *contentTy to = from.sub(sep+2).get(); from = from.sub(0, sep).get(); } + from.lowercase(); + sep = from.indexOf(';'); + if (sep != -1)
Re: [htdig] make error on solaris 2.6
According to Ronald Edward Petty: When I was doing make I got this error for DocumentDB.cc and I did a work around doing this, but then I type make again and it gets past DocumentDB.cc and does this for the next file... Is there something wrong with my shell or something... I dont feel like typing #!/usr/bin/tcsh setenv BIN_DIR /export/netapp/user/rpy/htdig/bin setenv DCOMMON_DIR "/export/netapp/user/rpy/htdig/common" setenv DCONFIG_DIR "/export/netapp/user/rpy/htdig/conf" setenv DATABASE_DIR "/export/netapp/user/rpy/htdig/db" setenv IMAGE_URL_PREFIX "/export/netapp/user/rpy/htdig/images" setenv PDF_PARSER "/usr/local/bin/acroread" setenv SORT_PROG "/bin/sort" setenv DEFAULT_CONFIG_FILE "/export/netapp/user/rpy/htdig/conf/htdig.conf" c++ -c -DBIN_DIR -DCOMMON_DIR -DCONFIG_DIR -DDATABASE_DIR -DIMAGE_URL_PREFIX -DPDF_PARSER -DSORT_PROG -DDEFAULT_CONFIG_FILE -I../htlib -I../ht common -I../db/dist -I../include -g -O2 DocumentDB.cc - Any idea why this top thing worked but the other doesn't - ares:/export/netapp/user/rpy/htdig-3.1.5/ make make[1]: Entering directory `/export/netapp/user/rpy/htdig-3.1.5/db/dist' make[1]: Nothing to be done for `all'. make[1]: Leaving directory `/export/netapp/user/rpy/htdig-3.1.5/db/dist' make[1]: Entering directory `/export/netapp/user/rpy/htdig-3.1.5/htlib' make[1]: Nothing to be done for `all'. make[1]: Leaving directory `/export/netapp/user/rpy/htdig-3.1.5/htlib' make[1]: Entering directory `/export/netapp/user/rpy/htdig-3.1.5/htcommon' c++ -c -DBIN_DIR=\"/export/netapp/user/rpy/htdig/bin\" -DCOMMON_DIR=\"/export/netapp/user/rpy/htdig/common\" -DCONFIG_DIR=\"/export/netapp/user/rpy/htdig/conf\" -DDATABASE_DIR=\"/export/netapp/user/rpy/htdig/db\" -DIMAGE_URL_PREFIX=\"/export/netapp/user/rpy/htdig/images \" ^ | I think the problem is right here. + There seems to be a space (or maybe a control character) in your definition for the IMAGE_URL_PREFIX, which is messing things up. -DPDF_PARSER=\"/usr/local/bin/acroread\" -DSORT_PROG=\"/bin/sort\" -DDEFAULT_CONFIG_FILE=\"/export/netapp/user/rpy/htdig/conf/htdig.conf\" -I../htlib -I../htcommon -I../db/dist -I../include -g -O2 DocumentRef.cc c++: ": No such file or directory DocumentRef.cc:0: unterminated string or character constant DocumentRef.cc:0: possible real start of unterminated constant make[1]: *** [DocumentRef.o] Error 1 make[1]: Leaving directory `/export/netapp/user/rpy/htdig-3.1.5/htcommon' make: *** [all] Error 1 ares:/export/netapp/user/rpy/htdig-3.1.5/ By the way, I think you may be misunderstanding what the IMAGE_URL_PREFIX is supposed to be. It's supposed to be an URL path, relative to the DocumentRoot of your web server, not relative to your system's root directory. It's the IMAGE_DIR that is relative to the system's root directory, but it must point to a directory that will be somewhere under the DocumentRoot, so that the installed image files can be accessed by web clients. E.g., on my system, IMAGE_DIR is set to /home/httpd/html/htdig, and my Apache configuration sets DocumentRoot to /home/httpd/html, so my IMAGE_URL_PREFIX is simply "/htdig". -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] NEED HELP with indexing
On Mon, 15 Jan 2001, George Roberts wrote: I'm completely new to this software, but inherited a large site which uses it. I made a simple change to some javascript on one of the indexed pages, and I have NO CLUE how to reindex the whole site. Could someone please help? According to Geoff Hutchison: It depends a lot on how the original person installed it and your system. But usually there's a program "rundig" that creates the databases. Many people just hack this script to fit local needs, others create local versions (e.g. mine is "rundig.sh"). Of course the best thing to do is to write a version that can be run through the cron program which ensures the indexes are updated on a regular basis automatically. But I digress. So first, I'd suggest finding the directory containing the databases, e.g. locate db.wordlist Next, make a backup of the files in there. Then see if you can find the rundig script. If so, look for any evidence of a local version with a possibly newer date. If the rundig script you have mentions "alt" somewhere in it, try running "rundig -a" which will update the databases using alternate .work files. That should get you started in the right direction. But bear in mind that htdig does not index JavaScript, so your changes to the JavaScript on one of the indexed pages may not have any effect at all on searches even after you reindex. See http://www.htdig.org/FAQ.html#q5.18 -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] PATCH correction: backport ExternalParser.cc from 3.2.0b3 to 3.1.5
I discovered some problems with the argument handling in the patch I posted earlier today. Please ignore that one and apply this one instead... According to Elijah Kagan: I run htdig 3.1.5. I tried both the Debian package and a compiled one with the same result. I am absolutely sure there is something stupid I forgot to put into the configuration. OK, after getting to the bottom of this (I think!), I have backported the 3.2.0b3 development code for htdig/ExternalParser.cc to version 3.1.5, to fix this and other problems. Please give this patch file a try and let me know if it works. You will probably get a warning about the wait() function being implicitly declared, unless you manually define HAVE_WAIT_H or HAVE_SYS_WAIT_H (depending on whether your system has wait.h or sys/wait.h). Also, if your system has the mkstemp() function, you may want to define HAVE_MKSTEMP manually as well, as this will enhance security. I didn't have time to figure out how to patch aclocal.m4 and configure to add tests for all of these. The patch fixes the following problems in external_parsers support in 3.1.5: - it got confused by "; charset=..." in the Content-Type header, as described in "http://www.htdig.org/mail/2000/09/index.html#75". - security problems with using popen(), and therefore the shell, to parse URL and content-type strings from untrusted sources (now uses pipe/fork/exec instead of popen) - PR#542, PR#951. - used predictable temporary file name, which could be exploited via symlinks - fixed if mkstemp() exists HAVE_MKSTEMP is defined. - binary output from an external converter could get mangled. - error messages were sometimes ambiguous or missing altogether. - didn't open temporary file in binary mode for non-Unix systems (attempts were made to fix this, but it's not clear yet whether the security fixes and pipe/fork/exec will port well to Cygwin). Here's the patch, which you can apply in the main source directory for htdig-3.1.5 using "patch -p0 this-file": --- htdig/ExternalParser.cc.origThu Feb 24 20:29:10 2000 +++ htdig/ExternalParser.cc Mon Jan 15 17:16:50 2001 @@ -1,14 +1,24 @@ // // ExternalParser.cc // -// Implementation of ExternalParser -// Allows external programs to parse unknown document formats. -// The parser is expected to return the document in a specific format. -// The format is documented in http://www.htdig.org/attrs.html#external_parser +// ExternalParser: Implementation of ExternalParser +// Allows external programs to parse unknown document formats. +// The parser is expected to return the document in a +// specific format. The format is documented +// in http://www.htdig.org/attrs.html#external_parser // -#if RELEASE -static char RCSid[] = "$Id: ExternalParser.cc,v 1.9.2.3 1999/11/24 02:14:09 grdetil Exp $"; -#endif +// Part of the ht://Dig package http://www.htdig.org/ +// Copyright (c) 1995-2001 The ht://Dig Group +// For copyright details, see the file COPYING in your distribution +// or the GNU Public License version 2 or later +// http://www.gnu.org/copyleft/gpl.html +// +// $Id: ExternalParser.cc,v 1.9.2.4 2001/01/15 17:16:50 grdetil Exp $ +// + +#ifdef HAVE_CONFIG_H +#include "htconfig.h" +#endif /* HAVE_CONFIG_H */ #include "ExternalParser.h" #include "HTML.h" @@ -19,9 +29,18 @@ static char RCSid[] = "$Id: ExternalPars #include "QuotedStringList.h" #include "URL.h" #include "Dictionary.h" +#include "good_strtok.h" + #include ctype.h #include stdio.h -#include "good_strtok.h" +#include unistd.h +#include stdlib.h +#include fcntl.h +#ifdef HAVE_WAIT_H +#include wait.h +#elif HAVE_SYS_WAIT_H +#include sys/wait.h +#endif static Dictionary *parsers = 0; static Dictionary *toTypes = 0; @@ -32,9 +51,18 @@ extern StringconfigFile; // ExternalParser::ExternalParser(char *contentType) { + String mime; + int sep; + if (canParse(contentType)) { - currentParser = ((String *)parsers-Find(contentType))-get(); +String mime = contentType; + mime.lowercase(); + sep = mime.indexOf(';'); + if (sep != -1) + mime = mime.sub(0, sep).get(); + + currentParser = ((String *)parsers-Find(mime))-get(); } ExternalParser::contentType = contentType; } @@ -89,6 +117,8 @@ ExternalParser::readLine(FILE *in, Strin int ExternalParser::canParse(char *contentType) { + int sep; + if (!parsers) { parsers = new Dictionary(); @@ -97,7 +127,6 @@ ExternalParser::canParse(char *contentTy QuotedStringListqsl(config["external_parsers"], " \t"); String from, to; int i; - int sep; for (i = 0; qsl[i]; i += 2) { @@ -109,11 +138,22 @@ ExternalParser::canParse(char *contentTy to = from.sub(sep+2).get();
Re: [htdig] Problem with PDF files....
According to Elijah Kagan: 1. I run htdig with an explicit -c option, so it uses the correct conf file. 2. I rewrote the external_parsers so it includes only one line... 3. ..and it is the first line in the file Results are the same! It is still looking for an acroread! Please, help. I am getting desperate... Hmm. You're sure you're running version 3.1.5 of htdig, and you don't have a pre-3.1.4 binary of htdig kicking around that you might be unknowingly running instead? External converter support was added to the external_parsers attribute only in version 3.1.4 and above. If you're sure this isn't the problem either, please send me a copy of your conf file as it stands now (preferably uuencoded right on your htdig box to prevent e-mail mangling of it), and I'll have a look and try a test or two. Oh, another thing. You mentioned this was on a Debian system. Did you compile htdig yourself, or did you use a pre-compiled binary? If the latter, which one? -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] 3.1.3 engine on 3.1.5 db
According to Dave Salisbury: If you created your database with htdig 3.1.5, and want to search it with htsearch 3.1.3, that's a bad idea. The most glaring bug in releases before 3.1.5 is in htsearch, so you really should upgrade it. I take it one of the worst things is the security hole which allows a user to view any file with read permissions ( ouch! ) That's the one! Is there any way to correct for this with a wrapper around htsearch? Reading the indices using 3.1.3 that were created by a 3.1.5 engine seems to work just fine. There would be, but it might be a tad tricky. The idea is to use a backslash to quote any left quote (`), dollar sign ($) or backslash (\) in the query string that is part of an input parameter value that will get added to the config object as an internal attribute setting. The lines in htsearch/htsearch.cc that do this are (from a grep): config.Add("match_method", input["method"]); config.Add("template_name", input["format"]); config.Add("matches_per_page", input["matchesperpage"]); config.Add("config", input["config"]); config.Add("restrict", input["restrict"]); config.Add("exclude", input["exclude"]); config.Add("keywords", input["keywords"]); config.Add("sort", input["sort"]); config.Add(form_vars[i], input[form_vars[i]]); The last one above is the tricky one, as it can be any input parameter name that you use in allow_in_form. Rather that limiting the backslash escaping of special characters to only the values of these parameters, it might be better to do the whole query string, but exclude a few parameters where this might be undesirable. I'd recommend NOT doing this for the "words" input parameter, for instance, but I can't think of any others right off-hand where you would not want to do this. Anyone out there want to bash Glimpse before I look into it. I'm hoping to get it at least to compile on an SGI. I won't do any bashing, but if htdig is your preference, I'd suggest not giving up on it too quickly. Did you have a look at David Adams' recent post about an "IRIX compile fix"? In it, he forwarded a message from Bob MacCallum that explains a workaround to some problems on IRIX 6.5, using cc, not gcc. If you haven't already, you ought to try that before abandoning htdig. On the other hand, if you have an existing database built with version 3.1.3, and want to use it with the latest htsearch, that should work without any difficulty. However, you'll lose out on several benefits in the latest htdig (better parsing of meta tags, parsing img alt text, fixed parsing of URL parameters, etc.), Couldn't find what "fixed parsing of URL parameters" means. The query string is part of what's indexed?? The query string isn't indexed, but it's part of the URL. 3.1.3 mangled bare ampersands () in the query string in an URL, and versions before that didn't decode sequences like eacute; within an URL. I think the ChangeLog explains it better than the release notes. Tue Nov 23 19:52:27 1999 Gilles Detillieux [EMAIL PROTECTED] * htdig/HTML.cc(transSGML), htdig/SGMLEntities.cc(translateAndUpdate): Fix the infamous problem in htdig 3.1.3 of mangling URL parameters that contain bare ampersands (), and not converting amp; entities in URLs. ... Wed Sep 1 15:39:41 1999 Gilles Detillieux [EMAIL PROTECTED] * htdig/HTML.h, htdig/HTML.cc(do_tag, transSGML): Fix the HTML parser to decode SGML entities within tag attributes. which you'll only get if you reindex with htdig 3.1.5. Maybe none of these matter for your site, though. See the release notes and ChangeLog for details. I don't think they're essential. Except for the URL parameter mangling fix, of course. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Performance problems with htdig 3.2.0b2
According to Mathias Rohland: I have a problem with the performance of htdig 3.2.0b2. I'm indexing about +25.500 HTML-docs at the moment and it takes several (+8) hours to index them on a machine that's not to busy with outher tasks (PII 233 w/ 512K Cache and 128MB RAM). ... I need to use htdig 3.2.0b2 as we need phrase searching and a second machine in another location that runs with solaris won't like 3.2.0b3. 3.2.0b3 is still a work in progress, but it already fixes a large number of bugs in 3.2.0b2. Try the latest snapshot of b3, and if you still can't compile it on Solaris, please e-mail us at this list the output of the configure and make runs. I don't think it makes sense for us to take time debugging an old beta version when the real problem here is you can't build the latest beta pre-release. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] security hole (was: how to set the $(PERCENT)? -it always show 1%)
According to Edward Lu: Geoff, What is the security hole in version 3.1.5? It sounds scary. The security hole is in version BEFORE 3.1.5, and is fixed in 3.1.5. It allowed a user to snoop through any file on your web server's file system, as long as it was readable by the user ID under which the web server process runs, just by passing it a special query string in the htsearch URL. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] Re: any suggestions for using 3.1.5 or 3.2.0b2?
According to Edward Lu: According to the release note for htdig-3.2.0b2. It added more functionality and fixed all known bugs after 3.1.5 But apparently it still has the relevance ($(PERCENT)) bug and not stable enough. I am asking for any suggestions about which version (3.1.5 or 3.2.0b2) should be used for our company web site. Any experience about the advantage and disadvantage of both the versions? Any suggestions will be greatly appreciated. -Edward It's correct that 3.2.0b2 fixed many known bugs in 3.1.5, but none of these were earth-shattering problems. There were many limitations, though, in the 3.1.x series that required a pretty radical redesign of many components. While 3.2.0b2 did fix some bugs, it introduced a whole lot because of the large number of redesigned/rewritten components. That's why 3.2 is still in beta. The latest 3.2.0b3 pre-release source snapshot fixes a lot of the 3.2.0b2 bugs, but there are still some that remain. If you need the features of 3.2, then use the b3 snapshots, not the b2 release. If you don't need these features, and the limitations of 3.1 aren't a problem for you, then you'd be wise to stick to 3.1.5 for a production system until 3.2 gets a bit more of a shakeout. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] keep temp files while running indexer? How to...
According to Stephen Murray: Hi Gilles, When you wrote: "Only the one on the contrib section of the FTP site and web site is current." You were referring to rundig.sh at http://www.htdig.org/contrib/ - -- right? That's the one I should use? (As Geoff suggested?) I answered this yesterday evening, after the first time you asked, but here goes again... Yes, it's the Scripts sub-section of that part of the web site, which actually takes you to the http://www.htdig.org/files/contrib/scripts/ directory. To clarify further, the one you should NOT use is the one in the contrib directory of the htdig-3.1.5.tar.gz source distribution, or any other source distribution, as these are the ones that are outdated. Using the URL you mentioned in your e-mail above will get you to the correct script in two clicks. First, the link "Scripts" in the left frame, then the link "rundig.sh" in the right frame. Using the URL I mentioned in my reply above will get you there in one click. Either way, it's the same directory and the same file, presented either with or without the frame structure on the web page. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] using perl/cron to find badwords on site
According to Jerry Preeper: I don't know if anyone else has run across this yet, but I have a number of guestbooks and things like that where people can post and I would love to be able to find a way to set up a daily cron job with perl script that basically runs a set of badwords through htsearch and then emails me a list of just the urls it finds with those words in it... I don't really need things like the page title or description or stuff like that.. I'm assuming I'll need to use a system call in the script to some sort of command line option and loop it for each word... Any input would be greatly appreciated. I assume that you want your htdig database updated through this same cron job, before running htsearch, so that the database you search will contain any new postings to the guestbooks. The simplest way I can think of, assuming the correct settings are already made in htdig.conf, would be a shell script with these commands... htdig htmerge /path/to/cgi-bin/htsearch "words=badword1+badword2+badword3+badword4" Of course, if you want to write it in Perl, especially if you need more processing than simply running these programs, you can call the above commands in one or more calls to the system("...") function in Perl. You may want to customise the htsearch templates to get just the URL, if that's all you want (see template_map, search_results_header and search_results_footer in http://www.htdig.org/attrs.html). If you want to search for each word separately, rather than one query for all words, then you'd need to call htsearch once for each individual word. E.g. in a shell script, you could do: htdig; htmerge for word in badword1 badword2 badword3 badword4 do echo "${word}:" /path/to/cgi-bin/htsearch "words=${word}" done or: htdig; htmerge while read word do echo "${word}:" /path/to/cgi-bin/htsearch "words=${word}" done /path/to/bad-word-file However, it seems to me it would be better to search for all at once, unless you need a word by word summary of URLs. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] External Converter Prob
According to Reich, Stefan: all my descriptions are starting with "content-type: text/html". Is this normal behavior or is it, because I'm using an external converter to do some modifications on the spidered html files. I registered my converter for text/html - text/myhtml conversion. I've patched the html parser to recognize this in addition to text/html. I'm sure my external converter doesn't write text/html to the output stream. Any ideas? No, this is not normal behaviour. If you're certain that your external converter doesn't write this out, then we'd have to assume it comes from elsewhere. It may be a stupid question, but are you sure the pages you're indexing don't contain this extra header? I've seen defective CGI scripts, for example, that inadvertantly output two such headers in some situations. Ditto for SSI pages that call CGI scripts incorrectly. Finally, it's hard to be sure it isn't a problem with your patches to htdig, or to your particular configuration, without being able to see them. I don't know if this helps or not, but it may give you a few more places to look. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Unable to contact server-revisisted
According to Roger Weiss: I'm running htdig v3.1.5 and my digging seems to be running out of steam after it runs for anywhere from 20 minutes to an hour or so. The initial msg was "Unable to connect to server". So, I ran it again with -v v v to get the error message below. pick: ponderingjudd.xxx.com, # servers = 550 3213:3622:2:http://ponderingjudd.xxx.com/ponderingjudd/id6.html: Unable to build connection with ponderingjudd.xxx.com:80 no server running I've replaced part of the URL with xxx to protect the innocent. The server certainly is running and I had no trouble finding the mentioned url. Is there some parm I need to set or limit I need to raise? We're running an apache server with startservers =25 and minspace=10. I guess the next question, if you're sure the server is running, is can you access it from a client? More specifically, can you access it using a different web client on the same system as the one on which you're running htdig (e.g. from lynx, Netscape, kfm, or some other Linux/Unix-based web browser)? If you can, then the problem will be to figure out why htdig can't build the connection while other programs on the same system can. If you can't access the server from any client program on the same system, then the problem isn't with htdig, but with your network setup (e.g. firewall, packet filtering, or a bad connection from that system). -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] htdig
According to Geoff Hutchison: No regular expressions needed. You can limit URLs based on query patterns already. See the bad_querystr attribute: http://www.htdig.org/attrs.html#bad_querystr ... On Thu, 11 Jan 2001, Richard Bethany wrote: I'm the SysAdmin for our web servers and I'm working with Chuck (who does the development work) on this problem. Here's the "nuts bolts" of the problem. Our entire web server is set up with a menuing system being run through PHP3. This menuing system basically allows local documents/links to be reached via a URL off of the PHP3 file. In other words, if I try to access a particular page it will be accessed as http://ourweb.com/DEPT/index.php3?i=1e=3p=2:3:4:. In this scenario the only relevant piece of info is the "i" value; the remainder of the info simply describes which portions of the menu should be displayed. What ends up happening is that, for a page with eight(8) main menu items, 40,320 (8*7*6*5*4*3*2*1) different "hits" show up in htDig for each link!! I essentially need to exclude any URL where "p" has more than one value (i.e. - p=1: is okay, p=1:2: is not). I've looked through the mailing list archives and found a great deal of discussion on the topic of regular expressions with exclusions and also some talk of stripping parts of the URL, but I've seen nothing to indicate that any of this has actually been implemented. Do you know if there is any implementation of this? If not, I saw a reply to a different problem from Gilles indicating that the URL::normalizePath() function would be the best place to start hacking so I guess I'll try that. I guess the problem, though, is that without regular expressions it could mean a large list of possible values that need to be specified explicitly. The same problem exists for exclude_urls as for bad_querystr, as they're handled essentially the same way, the only difference being that bad_querystr is limited to patterns occurring on or after the last "?" in the URL. So, if p=1: is valid, but p=[2-9].* and p=1:[2-9].* are not, then the explicit list in bad_querystr would need to be: bad_querystr: p=2 p=3 p=4 p=5 p=6 p=7 p=8 p=9 \ p=1:2 p=1:3 p=1:4 p=1:5 p=1:6 p=1:7 p=1:8 p=1:9 It gets a bit more complicated if you need to deal with numbers of two or more digits too, because then you can allow p=1: but not p=1[0-9]:, so you'd need to include these patterns in the list too: p=10 p=11 p=12 p=13 p=14 p=15 p=16 p=17 p=18 p=19 p=1:1 So, while it's not pretty, it is feasible provided the range of possibilities doesn't get overly complex. This will be easier in 3.2, which will allow regular expressions. I think my suggestion for hacking URL::normalizePath() involved much more complicated patterns, and search-and-replace style substitutions based on those patterns. That may still be the way to go if you want to do normalisations of patterns rather than simple exclusions, e.g. if you're not guaranteed to hit a link to each page using a non-excluded pattern. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Problem with PDF files....
According to Elijah Kagan: Dear Everyone Hope this is the correct list to send such questions. If not, accept my apologies. When I run htdig on my files I get the following message when it comes to a PDF document: 41:41:3:http://myserver/~elijah/document.pdf: PDF::parse: cannot find pdf parser /usr/local/bin/acroread size = 1965732 For some reason htdig looks for an Acrobat while its config file clearly states: external_parsers: application/msword-text/html /usr/local/bin/conv_doc.pl \ application/postscript-text/html /usr/local/bin/conv_doc.pl \ application/pdf-text/html /usr/local/bin/conv_doc.pl The conv_doc.pl exists and working and the content type received from the server is application/pdf. Any ideas? ... P.S. I am running htdig 3.1.5 on a Debian system. There are a few possibilities: 1) htdig isn't looking at this config file, but another one, without the external_parsers definition; 2) there's a typo in the external_parsers definition that isn't showing up in the text you e-mailed above, e.g. a misspelled word or a space after one of the backslashes at the end of the first two lines; or 3) there's a definition right above your external_parsers definition that mistakenly ends with a backslash at the end of the line, causing your external_parsers definition to be swallowed up by the previous line. That htdig is attempting to invoke acroread confirms two things: a) the PDF file is correctly being tagged by the server as application/pdf, and b) htdig is not seeing a usable definition of an external parser for that content-type, for any of the reasons outlined above. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] htdig
According to Richard Bethany: That was my fear as well. For the one link below with eight menu items, I need to accept p=1: through p=8: to pick up any/all links in the submenus, but I would have to reject the other 40,312 possible combinations of values that "p" can have. As you stated, that would be a mite cumbersome and, if we had pages with more menu items (we do), it would become exponentially more impossible (-- can something be "more" impossible? How about more improbable?) to limit the accepted values. Does the 3.2 beta release seem pretty stable? Does the regex functionality work properly? If so, perhaps I'll give that a shot. If not, I suppose I'll just dig around in the code to see if I can find a way to get it to do what we need. The current 3.2 beta release (b2) isn't stable. The latest development snapshot for 3.2.0b3 is much more so, but IMHO still not quite ready for prime-time. Ironically, one of the remaining problems is that long, complex regular expressions seem to be silently failing right now, so we still need to get to the bottom of that. However, even you you need to reject 40,312 possible combinations of values, it doesn't mean you'd need to explicitly list each of those, as many of them could be covered by the same substring. The current handling of exclude_urls and bad_querystr does substring matching, so there's an implied .* on either side of each string you give for these two attributes. Because any of 1 though 8 can be used as the intial p= value, it makes the problem more complicated than I assumed, but not by a huge amount. If I understand correctly, as long as there's only one menu value specified, it's OK, but if there are two or more, it's not OK, and only 1 through 8 will appear as possible menu values. Now, a string of more than two menu values will be matched by a substring of only two values, so all you need are all possible series of two values, or 8 x 8 = 64 patterns, p=1:1 through to p=8:8. Correct? -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] PDFs, numbers, and percent signs
According to Philip E. Varner: 1) The directive minimum_word_length defaults to 3, but when dealing with two-digit numbers, this should be set to two. The default would catch "25%", but not other numbers. This needs to be set in htdig.conf, AND in parse_doc.pl, if using it. parse_doc.pl should probably be changed to read variables from htdig.conf at some point in time, but that's not my call. 2) In additon to minimum_word_length, I added these attributes to htdig.conf allow_numbers: true extra_word_characters: %$# valid_punctuation: .-_/!^' By default, htdig ignores numbers, so I set it count them. It also ignores most punctuation, so I allow the characters %$# since they are common pre/suffixes for numbers. valid_punctuation then says what to ignore. Also, these need to be accounted for in parse_doc.pl. 3) The default for parse_doc.pl is to strip all punctuation, with the command tr{-\255._/!#$%^'}{}d; I changed this to tr{-\255._/!^'}{}d; to leave the punctuation I wanted. However, this punctuation was still deleted because of the way the text is split() into and array. I changed the command push @allwords, grep { length = $minimum_word_length } split/\W+/; to push @allwords, grep { length = $minimum_word_length } split /\s+/; \W matches anything that's not a word, which includes punctuation. So, punctuation was still getting stripped out. \s matches all whitespace, which is what I really want, since all "offending" punctuation was removed earlier. This works for me, but might not work for everyone. 4) I increased the limit on these two attributes, since PDF are larger, I only had a few dozen, and I wanted good matches. This is probably not a good idea if you have a lot of files, though. max_head_length:50 max_doc_size: 5000 If anyone has any other suggestions, I'd like to hear about them. Most of the problems you ran into could have easily been avoided if you tossed parse_doc.pl into the bit bucket and used an external converter like doc2html.pl or conv_doc.pl instead. As you realised, external parsers don't read your config file attributes, and it would mean making them extremely big and complicated, with a lot of duplication of code, to get them to do this properly. That's why external parsers, in most cases, are a bad idea. That's also why I added external converter support back in version 3.1.4. That way, you just need a simple conversion to plain text or HTML, and all the gory details of parsing the document in accordance with the users wishes are handled internally by the text or HTML parser. So, no, parse_doc.pl should not be changed to read the htdig.conf attributes. It should be given a decent burial and forgotten. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] keep temp files while running indexer? How to...
According to Geoff Hutchison: At 10:43 AM -0800 1/9/01, [EMAIL PROTECTED] wrote: 1) Are we right in the assumptions we're making above (the temp files are being destroyed and are thus not available during indexing) and If you are not specifying the -a flag to htdig/htmerge then it will modify the filenames specified in your htdig.conf. This would probably not be what you want. 2) if so, how do I change the conf file so that the temporary files will be available to the search engine during indexing - - so that the search engine will still work during indexing? You might want to take a look at the rundig.sh script in the contrib/ section (I'm pretty sure it's in the releases, but it's definitely on the FTP server.) The version in the release distributions, and even in the current snapshots, is out of date, and still refers to a .gdbm database. Only the one on the contrib section of the FTP site and web site is current. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] keep temp files while running indexer? How to...
According to [EMAIL PROTECTED]: A couple questions: 1) The file you're referring to is rundig.sh on http://www.htdig.org/contrib/ (right?) Yes, it's the Scripts sub-section of that part of the web site, which actually takes you to the http://www.htdig.org/files/contrib/scripts/ directory. 2) Does the file have to be modified for my system or can I use it as is? (I know, dumb question) Well, shell scripts don't read the htdig.conf file, so it may be that some changes there will require corresponding changes in the scripts, especially with regard to file and directory names. Take a close look at it. 3) Does the file go in bin/rundig or in cgi-bin/rundig? Only htsearch should go in cgi-bin. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Multiple domain names pointing on the same site
According to Malcolm Austen: On Mon, 8 Jan 2001 [EMAIL PROTECTED] wrote: + If a site can be reached via different domain names, + is there a trick to make htsearch generate result + links pointing to the domain name the user reached + the site with ? Check out the server_aliases: options. It does just what you want. server_aliases: a.com:80=b.com:80 will result in references to a.com being treated as if they were references to b.com Well, this is a start, but it's only part of the solution. What this will do is ensure that only the canonical server name, b.com in this example, is used for entries in the database. However... Greg also wrote: + A user reaches the site a.com, makes a search, the result + would be www.a.com/searchedpage.html + If another user reaches the site b.com (wich is the same + document as a.com), the result would link to + www.b.com/searchedpage.html This is tricker, as what you want is for a given, presumably static database for all domains, to alter the search results' domain names to match the domain name used in the URL that called htsearch. I think this would require a combination of server_aliases as above for canonicalising the domain name, and url_part_aliases to encode the canonical domain in the database. Then, the search wrapper would figure out the domain name used in the CGI URL, and pass that to the real htsearch which would use it in its own url_part_aliases to decode the encoded canonical domain into the desired domain name. For example, in htdig and htmerge's htdig.conf: server_aliases: www.a.com:80=www.real.com:80 \ www.b.com:80=www.real.com:80 url_part_aliases: www.real.com *site Then in htsearch's config file: url_part_aliases: ${searchdomain} *site searchdomain: www.real.com allow_in_form: searchdomain Then, the search form would set the "config" input parameter to set this particular search config file, and set the action to call a wrapper script like this one, using the "GET" method: - #!/bin/sh case "$QUERY_STRING" in *searchdomain=*);; # searchdomain is already set, so leave it *) # set searchdomain to HTTP host name used in request QUERY_STRING="${QUERY_STRING}searchdomain=$HTTP_HOST" export QUERY_STRING ;; esac exec /some/path/to/real/htsearch - I'm pretty sure this should work, because htsearch seems to parse allow_in_form's value, and make its input parameters override the corresponding config attributes, before the url_part_aliases value is parsed by the HtURLCodec class. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] Re: Enhancement request (PR#991)
According to [EMAIL PROTECTED]: Is it possible for you to add a feature to the config file to allow custom information in anchor urls in excerpts. e.g. I would like to add the "target" attribute to the anchor urls so that I can direct the matching url to another frame on the page. A pretty quick and easy way of doing this would be to change this line in htsearch/Display.cc's Display::hilight() method (line 1215 in unpatched 3.1.5 code): result "a href=\"" urlanchor "\""; to something like: result "a href=\"" urlanchor "\" " config["urlanchor_parameters"] ""; Then, you could set this in your htsearch config file: urlanchor_parameters: target="body" I can think of other, more powerful and flexible ways of doing this but they'd involve much more complicated changes to the code. This one should do the job for you. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] 3.1.5 engine on 3.1.3 db
According to Dave Salisbury: From: Geoff Hutchison [EMAIL PROTECTED] But the root question is: Why are you having problems compiling 3.1.5 on IRIX? I posted this a day or two ago, with no responses. Any help would be appreciated! IRIX 6.5 things go well until these errors show up. Unfortunately, there aren't a lot of people on the list with IRIX experience, so it's hard to make rapid headway on that front. make[1]: Entering directory `/home/salisbur/htdig-3.1.5/htfuzzy' g++ -o htfuzzy -L../htlib -L../htcommon -L../db/dist -L/usr/lib32 Endings.o EndingsDB.o Exact.o Fuzzy.o Metaphone.o Soundex.o SuffixEntry.o Synonym.o htfuzzy.o Substring.o Prefix.o ../htcommon/libcommon.a ../htlib/libht.a ../db/dist/libdb.a -lnsl -lsocket ld32: WARNING 131: Multiply defined weak symbol:(Deserialize__6ObjectR6StringRi) in Endings.o and EndingsDB.o (2nd definition ignored). ... It's hard to say for sure what's causing these warnings. It seems perhaps the override of virtual methods in the Object class with non-virtual ones in the String class is causing this. Maybe it's just because SGI's ld32 doesn't like the way g++ builds these objects. and on till a warning message limit is reached and then many errors like: ... ld32: Giving up after printing 50 warnings. Use -wall to print all warnings. ld32: ERROR 33 : Unresolved text symbol "cout" -- 1st referenced by EndingsDB.o. Use linker option -v to see when and which objects, archives and dsos are loaded. ld32: ERROR 33 : Unresolved text symbol "__ls__7ostreamPCc" -- 1st referenced by EndingsDB.o. Use linker option -v to see when and which objects, archives and dsos are loaded. ... Now these errors seem to be the result of ld32 not finding the required C++ classes in the C++ system library. Either it's not finding the library at all (or not told where to find it), or the library is somehow incompatible with the g++ compiler you have installed. Given these errors, I doubt your system could even compile and link a simple "Hello, World" program in C++. You'd need either to get to the bottom of this and fix it, or you'd need to get 3.1.5 built on the system from which you got the 3.1.3 binaries. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Re: Enhancement request (PR#991)
According to Kapil Biyani: Instead of editing the .cc files from the source, just to add the target param, I guess you can even change the long.html file in the $commondir All one has to do is enable the long file in the source and then edit the particular file and add parameters required to it. See below what to add in config file. You can see an working example of it at http://www.indiainfoline.com/search/ I have infact added a complete line of code to it which open the results in a new frame which also has another frame where the search box is again displayed...(sounds confusing, check the site :-( That fix works for the main search result URL shown by the result template, but Stephen was asking about the anchor URLs that pop up in the excerpt, when the first matched word appears after an anchor tag in the source document, and when add_anchors_to_excerpt is true. There's nothing you can do about these links in an unpatched htsearch, because the HTML that generates them is hardcoded in the hilight() method. If you want to use a frames-based setup for htsearch, where the text from matched pages is displayed in a different frame than the htsearch results (and their links to these matched pages), then you must use target specifications for the main URL in the template as well as the URLs in the excerpt, unless you disable add_anchors_to_excerpt. As Stephen was asking only about the excerpts, I assumed he had figured out about the templates, and didn't want to disable the anchors in excerpts. Infact if you add the long.html file in your config file you can make it configurable to any extent you want...Here is what to add.. - template_map: Long long ${common_dir}/long.html \ Short short ${common_dir}/short.html template_name: long -- for more info. check the http://www.indiainfoline.com/search/ page... (*Hope I am not wrong, if I am then SORRY*) No apologies necessary. If you search for "name" on your site above, you'll see a couple excerpts where the highlighted word is hyperlinked to an anchor within the document. That will illustrate what I'm talking about. Your template changes don't affect these links. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] [PB] reference count overflow
According to heddy Boubaker: "Gilles" == Gilles Detillieux [EMAIL PROTECTED] writes: I sometimes have errors like that when searching: DB2 problem...: ...intranet-db.words.db: page 0: reference count overflow Gilles My guess would be a corrupt database. Try rebuilding it from Gilles scratch. Thanks, I solved the pb by doing that. But what could corrupt the db ? The only actions I did on the concerned db was: htdig init then htmerge from scratch and then, once a week, htdig htmerge are run again with the same config... If something corrupted my db it should be a bug somewhere no? Potentially, yes. We've received scattered reports of inexplicable database corruption in the 3.1.x series, but never anything solid or consistent enough that we could nail down to a specific bug. We don't know for sure even if it is a bug, but we suspect that it is, albeit an obscure and infrequent one. If the problem happens frequently and/or consistently, please let us know and we can try to track it down. Otherwise, all we can recommend is to rebuild the index from time to time to correct this. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] $(NEXTPAGE) does not work in header
According to SMantscheff: --- start of header file --- pDie Suche nach i${WORDS}/i ergab folgende Resultate [Seite$ {PAGE}/${PAGES}]:/p $(PREVPAGE) $(NEXTPAGE) --- end of header file --- It seems that the $(NEXTPAGE) variable does not work in the header files, while $(PREVPAGE) does. I've got a search result with 4 pages, so both items should appear. What am I missing? Hard to say for sure. Which version are you running, and what is the value of maximum_pages in your config? The logic htsearch uses is that if the current page number is less than maximum_pages, it will create a link for the next page, using the value of next_page_text as the link description text. If this string is empty, it could result in an "invisible" link, i.e. the a href... tag and the /a tag with nothing in between. If the current page number is equal to maximum_pages, it will set NEXTPAGE to the value of no_next_page_text, which commonly is an empty string. In other words, NEXTPAGE will normally be empty for the last page of search results, while PREVPAGE will normally be empty for the first page. Check your values for all of these attributes above, and take a good look at the resulting HTML output of htsearch. Also, take a close look at your config file and templates for any typos. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Fw: [htdig] - Question for start_url and exclude_urls
According to "Mohai Wang" [EMAIL PROTECTED]: 1. start_url: as long as start_url = "http://stagsite.coreon.com/download/". When I run "rundig -vvv log", I got error message from screen "DB2 problem...: missing or empty key value specified". I also attached debug mode "log" and "htdig.conf" files, please take a look. Did I set wrong option? If start_url = "http://stagsite.coreon.com/" that it will go through to write index, because I only need to write everything under "download" nothing else. The "missing or empty key value specified" error happens when the one and only entry in the db.docdb database is deleted because the document could not be fetched. I.e. this is a symptom, and not the root cause of the problem. The root cause is very clearly indicated in your attached "log.dat" file: 0:0:0:http://stagesite.coreon.com/download/: Retrieval command for http://stagesite.coreon.com/download/: GET /download/ HTTP/1.0 User-Agent: htdig/3.1.5 ([EMAIL PROTECTED]) Host: stagesite.coreon.com Header line: HTTP/1.1 403 Forbidden The 403 Forbidden error means htdig could not fetch the only document specified in your start_url, i.e. the /download/ directory. 403 errors are almost always the result of file permission problems. The web server's user ID does not have read permission (or search/execute permission) on that directory, so no web client can access it from your web server. You'd almost certainly get the same error from your web browser if you attempted to look at that directory from there using this same URL. 2. exclude_urls: I try to do something differently, start_url = "http://stagsite.coreon.com/" then I added exclude_urls = "/cgi-bin/ /calendar/ /coreonlib/". When I run "rundig -vvv log3", it will read /coreonlib/ first then stop. After I took off "coreonlib" from exclude_urls then rerun "rundig -vvv log2" that everything are indexing and reject "cgi-bin" and "calendar". Could you tell me why? Please take a look log3 file. ... 0:0:0:http://stagesite.coreon.com/: Retrieval command for http://stagesite.coreon.com/: GET / HTTP/1.0 User-Agent: htdig/3.1.5 ([EMAIL PROTECTED]) Host: stagesite.coreon.com Header line: HTTP/1.1 200 OK Header line: Date: Thu, 04 Jan 2001 16:27:48 GMT Header line: Server: Apache/1.3.12 (Unix) tomcat/1.0 mod_perl/1.24 mod_ssl/2.6.6 OpenSSL/0.9.4 Header line: Last-Modified: Tue, 12 Dec 2000 02:14:53 GMT Translated Tue, 12 Dec 2000 02:14:53 GMT to 2000-12-12 02:14:53 (100) And converted to Tue, 12 Dec 2000 02:14:53 Header line: ETag: "48890-dc0-3a358a1d" Header line: Accept-Ranges: bytes Header line: Content-Length: 3520 Header line: Connection: close Header line: Content-Type: text/html Header line: returnStatus = 0 Read 3520 from document Read a total of 3520 bytes title: Insite href: http://stagesite.coreon.com/coreonlib/html/top_index.htm () Rejected: Item in the exclude list: item # 1 length: 11 url rejected: (level 1)http://stagesite.coreon.com/coreonlib/html/top_index.htm href: http://stagesite.coreon.com/coreonlib/html/main.html () Rejected: Item in the exclude list: item # 1 length: 11 url rejected: (level 1)http://stagesite.coreon.com/coreonlib/html/main.html size = 3520 pick: stagesite.coreon.com, # servers = 1 htmerge: Sorting... htmerge: Merging... 0/http://stagesite.coreon.com/ This log3.dat file doesn't look complete to me. With the third level of verbosity that you'd need to get detailed rejection messages like above, I think you should be getting much more detail than that. Is this just an excerpt of the full log? From what I can see above, it seems that htdig is only picking up two links from your main index page, and both are rejected. This is what you want, according to your comments above, because log3 is the result of running htdig with /coreonlib/ in exclude_urls. The question is why does htdig not pick up and use any other links, and I can't answer that if I don't have the complete log. Does the complete log indicate more links than that, and if so, what are the reasons for rejection? If htdig doesn't see any links other than those two, you need to find out why. Are you expecting it to see JavaScript links? It won't! See the FAQ (http://www.htdig.org/FAQ.html), especially questions 5.25 and 5.27. Perhaps htdig doesn't see any links to the rest of your site on the main index page, but does find them somewhere in coreonlib when you allow it to look there. In this case, you'd need to add something on your main index page that htdig can follow to get to the rest of the site. Also, please try to examine your logs more thoroughly, as errors like the 403 error above shouldn't be dismissed so easily as inconsequential. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7
Re: [htdig] 3.1.5 engine on 3.1.3 db
According to Dave Salisbury: does anyone know if I can read the database created using 3.1.5 with a 3.1.3 engine? ( just hoping to perhaps save some time before setting things up ) I don't see anything in the release notes to indicate it can't be done. The subject of your message and the question above seem to contradict each other, so it's not clear in which direction you want to go. If you created your database with htdig 3.1.5, and want to search it with htsearch 3.1.3, that's a bad idea. The most glaring bug in releases before 3.1.5 is in htsearch, so you really should upgrade it. On the other hand, if you have an existing database built with version 3.1.3, and want to use it with the latest htsearch, that should work without any difficulty. However, you'll lose out on several benefits in the latest htdig (better parsing of meta tags, parsing img alt text, fixed parsing of URL parameters, etc.), which you'll only get if you reindex with htdig 3.1.5. Maybe none of these matter for your site, though. See the release notes and ChangeLog for details. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Stats
According to htdighelp: Is there any way of generating stats of what users are searching for? I assume just the standard web logs. Does anyone suggest anything that is very good at not just stats but a clear picture of what users are doing. There are a couple techniques you can use. One is described at http://www.htdig.org/attrs.html#logging which uses the syslog facility. The other is to set up your search forms to use the GET method, rather than POST, so that the query strings always appear in the web server logs. Either way, you'll get raw data for every query htsearch processes, and you can develop some scripts to summarise it any which way you want. I don't know of any canned scripts to do this for you. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] no data
According to Paco Martinez: Without doing any changes in my Linux, when I execute /cgi-bin/htsearch, it appears tihs message... "The document contained no data." "Try again later, or contact ther server's adimnistrator." How can I solve it This sounds like a problem with unreadable templates. Make sure the template files (common/*.html) are readable by the user ID under which your web server runs, and than all directories leading up to and including the common directory are searchable (executable) by this same user ID. If that doesn't help, try running htsearch directly from the command line to see if you can get results that way. If you can, it's some sort of web server configuration problem. If you can't, then you'll need to look a little deeper into why htsearch is failing. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] what does --host in configure do?
According to Foskett Roger: Hi, according to this post, http://www.htdig.org/mail/1998/12/0169.html using ./configure --host=POSIX.2 sorts out a compile problem I am having on HPUX11. However, configure complains that 'POSIX.2' is invalid when used. Can anyone please tell me what options can be specified for '--host' as I have been unable to find anything explaining it Nor have I. I'm not a configure expert, but as far as I can tell, the --host option doesn't seem to do a whole lot. If you're experiencing the same sort of errors as in the message above, I'd suggest trying to set the CFLAGS and CPPFLAGS environment variables before calling ./configure, to tell the compiler or preprocessor where to find the files it needs. See http://www.htdig.org/mail/2000/09/0206.html -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] htdig refuses to compile on HPUX11
According to Foskett Roger: Hi, I am having imense problems getting htdig 3.1.5 to build on HPUX-11. Ive tried this: CC='cc' \ ./configure \ --prefix=/opt/www/htdig \ --host=POSIX.2 with these compilers (tried them all in various combinations) CC='gcc' or CC='cc' CPP='g++' or CPP='aCC' CXX='g++' or CXX='aCC' But keep getting the sort of errors below when running make (C stuff builds ok, but not the C++): gcc -c -DDEFAULT_CONFIG_FILE=\"/opt/www/htdig/conf/htdig.conf\" -I../htlib -I../htcommon -I../db/dist -I../include -O2 Substring.cc In file included from ../htlib/lib.h:22, from ../htlib/Object.h:23, from Fuzzy.h:24, from Substring.h:14, from Substring.cc:22: /usr/include/string.h:29: warning: declaration of `int memcmp(const void *, const void *, long unsigned int)' /usr/include/string.h:29: warning: conflicts with built-in declaration `int memcmp(const void *, const void *, unsigned int)' /usr/include/string.h:85: warning: declaration of `void * memcpy(void *, const void *, long unsigned int)' /usr/include/string.h:85: warning: conflicts with built-in declaration `void * memcpy(void *, const void *, unsigned int)' /usr/include/string.h:93: warning: declaration of `size_t strlen(const char *)' /usr/include/string.h:93: warning: conflicts with built-in declaration `unsigned int strlen(const char *)' as: "/var/tmp/ccWQJsCl.s", line 53: warning 36: Use of %fr21 is incorrect for the current LEVEL of 1.0 as: "/var/tmp/ccWQJsCl.s", line 75: warning 36: Use of %fr20 is incorrect for the current LEVEL of 1.0 as: "/var/tmp/ccWQJsCl.s", line 76: warning 36: Use of %fr19 is incorrect for the current LEVEL of 1.0 Eventually, the whole thing falls over when it comes to the link stage. This isn't the link stage, but the assembler stage. Evidently your C++ compiler is generating incorrect code for your assmebler, "as". Do you get the same error when you set CXX to g++ or aCC? I find it a bit odd that you're using gcc to compile C++ code, but it would be interesting to see how this changes when using a different front-end compiler than gcc. I have tried using '--host=POSIX.2' as suggested in this post http://www.htdig.org/mail/1998/12/0169.html but that doesnt seem to do anything (configure complains that it is invalid!?) I have also tried using g++ and specifying CXXFLAGS='-lstdc++' but still no luck. Can anyone please help me on this, I am completely stuck. The weirdest thing is that I somehow managed to get it working once before, (but accidently wrote over the exe's) This raises the obvious question of what has changed since you last got it working. Did you change any of your compilers at all? Did you use a different set of configure options before, and forgot what they were? -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Lost words
According to Tuomas Jormola: At my company we're trying to migrate from clumsy self-implemented search engine to htdig but it's not quite painless. The scenario is this: We've two databases on separate servers. One for public sites and one for intranet sites. The public database is only 4.2M and it has 7 sites indexed in it. The intra database is 690M with 4 sites/vhosts. Public search is accurate, fast and working great but intra search is causing troubles. For example, if you index a single site that contains lots of on-line manuals, the database is about 380M and word "aix" returns over 18000 hits. But when this site is indexed with the other intra sites, "aix" returns only 27 hits, most of them points to the on-line manual server as expected though. But where have thousands of the hits gone? So if these sites with gigabytes of content are indexed separately, the search is accurate but when the index is only one big db, a great amount of correct links is missed. Any guesses whether this is due to 1) feature in htdig/htmerge and if so, is there a way to disable it? 2) bug in htdig? 3) bug in Berkely db? 4) bad configuration? Hard to say for sure. As you're not using htmerge to build the one big db from the separate, samller dbs, that rules out problems in the merging code causing this problem. Is the size of the big db roughly equal to the sum of the sizes of the separate ones? It could be an obscure htdig or htmerge bug, or an AIX-specific problem. This sure isn't ringing any familiar bells, if that's what you're wondering. We're using htdig-3.1.5 and Berkeley db that was included in htdig archive running on AIX 4.3. htdig was compiled using IBM VisualAge C++ Pro for AIX Version 5. And here's the list of configuration options that were changed against the default config (excluding options that are solely used to control the layout of htsearch): # to make searching of words with umlauts work locale: fi_FI # everything is valid :) valid_punctuation: # to be able to search weird chars used in example scripts etc. extra_word_characters: @.-_/!#$%^' OK, that's a pretty unusual use of the above two attributes. Are you aware that with these settings, the following 3 words will be treated as separate and distinct words, and a search for one of them will not find the other two? aix-based aix aix. However, I don't think that's the cause of the problem you're reporting, if you're using the same settings for these attributes in all your databases. # numbers, too, of course allow_numbers: true # exact matches only search_algorithm: exact:1 BTW. Every test mentioned above was performed using a db built from the scratch with htdig/htmerge. Also it isn't due to erroneous restrict or exclude values. When talking about the size of the db, I mean the total size of all files in db directory. No support for optional algorithms were built using htfuzzy. The same config file was used in every test (well, database_dir and start_url were included from site-specific config file if only indexing a single site). htdig/htmerge reported no errors while creating each db and there's plenty of disk space. ... Oh I forgot to mention that one reason for this could be that frigging AIX, right? But I don't want to test htdig on my Linux desktop machine before everything else is tried at the actual server side. We sure haven't tested htdig very thoroughly on AIX, so I would be inclined to suspect a system-specific problem is at work here. I think testing your configurations on a Linux system would be a very good idea. If the problem occurs there too, then it would point more surely to a configuration problem or a bug. If the problem doesn't occur on Linux, then it's almost certainly an AIX-specific thing. Either way, we'd need more data to narrow it down. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] [PB] reference count overflow
According to heddy Boubaker: I sometimes have errors like that when searching: DB2 problem...: ...intranet-db.words.db: page 0: reference count overflow And of course it will generate no matches ... Any idea hom to solve that? htdig-3.1.5 My guess would be a corrupt database. Try rebuilding it from scratch. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] questions
According to John Lunstroth: Hi - I am a beginner here and have some more questions. Apologies for taking up time on some of this. 1. Ultimately I am interested in phrase and proximity search capabilities. I have been working at installing 3.1.5. I have read the release notes and see that the beta of 3.2 should be installed separately etc., since it uses different protocols, etc. I am wondering if I should just go ahead and start working with 3.2 - will it be difficult to upgrade between fixes? That depends on how important phrase searching is to you. There are still a lot of bugs in the query parser in 3.2, and a rewrite of it is slated for 3.2.0b4, so I don't know if the current phrase searching will be adequate. As far as upgrading between fixes, there's commonly a bit more effort involved in working with beta releases, although for some people they install quite smoothly. There may be database format changes coming down the road, so upgrading htdig may mean having to reindex from scratch. Switching between 3.1.5 and 3.2.0bX will certainly require reindexing. 2. Being new to the Lunix environment, I am just getting acquainted with the File Hierarchy Standards ideas, and am uncertain how important it is to follow based on the following. I have noticed that the FHS applies to administrators setting up systems, but I alos notice that my web host has other protocol in place for the section of the server I have access to. For example, the root of my server (I only have telnet/ftp access), has a /www directory that contains all of the websites, and a /home directory that is my home directory and contains all other home directories. I have real space in each subdirectory under my domain name - /www/myname/ and /home/myname. There is a link from /home to /www. This at first caused me some difficulty, but I got it figured out. The htdig configuration program/file assume that the website will be located under the server's /opt subdirectory - so configure by default produces files with this location: /opt/www. "opt" is the name of subdirectories in which the user should put non-system programs - their applications, if I udnerstand correctly. There are also the "var" and "bin" subdirectories. Is there a recommended file hierarchy I should use in the directory I have available? I am building my site in the /www/myname/ subdirectory. That is where cgi-bin is located (/www/myname/cgi-bin). Will it be easier, in the long run, to use a certain file system - I assume it will be, since htdig, and probably other apps, use a common base of standards I am unfamiliar with. Don't worry too much about the FHS. It's meant for people putting together packages for distribution. End users are not bound by it, and individual system installations may go with something very different in many cases. Go with what works. If your web hosting company imposes a different hierarchy, the easiest thing in the long run is to go with their setup as much as possible. It's easy enough to configure htdig to use any set of directories you want. I think the whole /opt thing is a Sun-ism that may have been adopted by some (but certainly not all) Linux distributions. On Red Hat systems, I go with something more FHS-like. I am asking a narrow question - what subdirectories would it be best to use in setting up htdig: /www/myname/opt (put htdig here as separate subdirectory /cgi-bin (htdig automatically puts stuff here /var (? not sure what to put here /bin (? not sure how to use this one - or even where it should be vis-a-vis htdig /htdocs (? is the name important or standard? the subdirectory "htdocs" - I assume this means hypertext docs - and should be where the content is - is "htdocs" a standard name, or an abbreviation used by the htdocs people? htdocs is commonly used as the name for the DocumentRoot, but I don't think there's any standard involved here. Go with whatever your web host uses as its document root directory, and put the "htdig" subdirectory that contains the image files right in that directory. Put htsearch in your web host's cgi-bin directory if at all possible (and if one is provided), to avoid having to specify a new ScriptAlias directory for CGI programs. The rest of the files (executables, common/* files, database directory) can go wherever you see fit, but make sure the common and database directories are accessible by the web server's user ID. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a
Re: [htdig] HTDig indexes virtual host homepage only
According to Paul Broome: I have a problem with htdig indexing a virtual host on one of my web servers - the other virtual hosts work fine. The server is a Sparc running Debian Slink and HTDig 3.1.5. Under the virtual domain http://www.ltbp.org, only the main page gets indexed. We have tried everything we can think of, but without any joy. See http://www.htdig.org/FAQ.html#q5.25 and try with one or two more -v options to see why the links are not being followed. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] A little automation
You may want to have a look at ConfigDig: http://configdig.sourceforge.net/ According to htdighelp: I agree and the fact is, if it takes a separate database, so be it, much simpler than trying to mod the code for this. Maybe in the future, it might be a good idea. Maybe some sort of web based config interface for all these things. Fact is, the net is not getting any smaller and I suspect that those using this seriously will need more functionality. Mike At least IMO, true operational requirements for any such system would be quite user-specific. The (full) set of user requirements would tend to include: Scheduling capabilities. Varying frequencies, perhaps even within the same URL. Inclusion and/or Exclusion of specified nodes, within a url. Statistical-recording capabilities. Varying underlying-database formats. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] PDF problems
According to The Melia Family: I am using HTDIG 3.1.5 on Redhat 7.0, and am having problems indexing PDF files. I have included my config -vv output below. I have no robots.txt file, and my max_doc_size is now 10M (one test .pdf file is only 27K and it also fails), as well as not rejecting pdf as an extension. I am using the latest xpdf with pdftotext, as well as the latest parse_doc and conv_doc scripts. I can manually pdftotext the pdf files and they do contain real text, not just images, I can also run parse_doc and conv_doc.plthey produce proper text. WHen I do a rundig, I get a 'URL rejected' message, I do not know why, this (I presume) leads to a Deleted No Excerpt message and the file (or any pdf file) is not indexed. Any suggestions?? The output from htdig isn't verbose enough to pinpoint the problems, but there is more than one problem here. First of all, I always strongly recommend conv_doc.pl or doc2html.pl over parse_doc.pl. The latter has been the source of too many problems in the past. Secondly, the rejected URLs and the "Deleted, no excerpt:" messages are two unrelated issues. URLs that are rejected by htdig at this stage (level 1 or level 2) will not even be seen by htmerge. For the rejection of URLs, see http://www.htdig.org/FAQ.html#q5.27 for how to deal with this. There isn't enough information in the htdig output or the excerpts of your htdig.conf you sent to be certain of what the reason for rejection is. However, the htdig output you sent seems to suggest a different start_url value than the one in your htdig.conf excerpt, so I suspect that the reason for the rejection is that the parent directory of the one you're indexing is not in the limits of limit_urls_to, which is a reasonable thing for a test case such as this. The "Deleted, no excerpt:" messages are usually as a result of documents that contain no indexable text, or external parsers that don't emit a usable "h" record (one more reason to use an external converter rather than an external parser). The challenge is to get to the bottom of why this happens in each individual case. You did run the scripts manually, which is what I usually recommend, but are you sure parse_doc.pl put out a proper "h" record and not just "w" records? Did you try htdig with conv_doc.pl instead, using the correct syntax for external_parsers as shown in conv_doc.pl's comments? Finally, I noticed you're getting the directory indexed multiple times due to Apache's fancy indexing feature. You can avoid this by adding "?D=A ?D=D ?M=A ?M=D ?N=A ?N=D ?S=A ?S=D" to exclude_urls (without the quotes) to suppress the alternately sorted views of the directory. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Indexing german pages
According to Radoy Pavlov: I have some questions regarding german language. Following the example in FAQ I've made my htdig.conf, extracted GermanWords.zip in $COMMON_DIR/german and edited htdig.conf. I've done this: rerun of rundig rerun of htfuzzy endings Still htdig cant find any words with umlauts (äöü etc), altho I have near 30 MB of databases. The search page shows that it is searching for the word .. with no effect. My search algorithm: search_algorithm: exact:1 endings:0.5 prefix:0.4 Perhaps I need to optimize the algorithm in order to get some matches? What is a "correct" algorithm ? No, the search algorithms are not likely the problem. If you can't even get an exact match, the problem lies elsewhere, and in this case I'd bet it's a problem with locales. You didn't mention what system you are running htdig on, and what your locale setting is. Some systems don't have properly functioning locale support at all (e.g. many libc-5 based Linux systems), and many don't have complete locale tables installed. See the thread entitled "Portuguese" from this past May, for more pointers on locale-related problems: http://www.htdig.org/mail/2000/05/index.html#61 -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Going for the big dig
According to Terry Collins: Geoff Hutchison wrote: At 10:14 AM +1100 12/19/00, Terry Collins wrote: And make sure you don't ignore robots.txt Yes, though someone would need to alter the code to do this. If you are doing an external site, it shouldn't be to much effort to just read this and set the excludes. Courtesy thing. I think you misunderstood. htdig already does read the robots.txt file and skips all disallowed documents. You don't need to do this manually. Geoff was saying you'd need to alter the code in order to ignore robots.txt, which definitely would be a bad thing if you then use the hacked htdig to index sites that are not your own. Actually, on my site I don't bother with exclude_urls at all, and use the robots.txt file instead. This way, anything that I don't want indexed by htdig won't be indexed by any other search engine either. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Hi, need help with searching database.
According to Akshay Guleria: thanx Gilles, However my problem is now fixed. I am using the following. htdig-3.2.0-0.b2.i386.rpm Ah, I wasn't aware that this version was out in RPM. There are many known bugs in 3.2.0b2, so don't be surprised if other problems occur. The scoring bug in htsearch is very likely to turn up, unless this RPM included a patch for this bug. There were 2 problems (in case you are interested): 1. htaccess files not allowing the rundig to connect to the server. There's not much htdig can do about this. If the .htaccess file sets up Basic authorization, then you can use the -u option to provide the user name and password to the server. If the .htaccess file set up some other restrictions, you're out of luck, but then these pages would also be inaccessible from a standard web browser coming in from the same address. 2. This file in /var/lib/htdig needed webserver owner's ownership. db.words.db_weakcmpr As soon as I owned it by "apache", it worked. I dont know but I think the rpm packager should have have taken care of this. Anyway, it works now. Thanx a lot for getting back. This point was covered in FAQ 5.17. I didn't realize you were running a 3.2 beta before. It's very important to mention which version you're running, because many, if not most, of the bugs and problems that come up are version-specific. There's not much the RPM packager can do about this particular bug in htdig, because the db.words.db_weakcmpr file is not normally part of the RPM distribution - it's only created after installation, when you run the rundig script. -Original Message----- From: Gilles Detillieux [mailto:[EMAIL PROTECTED]] Sent: Wednesday, December 13, 2000 11:20 PM To: [EMAIL PROTECTED] Cc: [EMAIL PROTECTED] Subject: Re: [htdig] Hi, need help with searching database. According to Akshay Guleria: I just installed Redhat7.0 on my machine. And then installed htdig rpm. I can see the page http://myhost/htdig/ which is the search page. Which htdig rpm did you install? For Red Hat 7.0, you should use the RPM for htdig-3.1.5-6 that comes with the 7.0 PowerTools. I make a search and for any search I make, it returns a page saying "No matches found for ... " Now, I ran rundig and it increased the file sizes in /var/lib/htdig. So, I presume the database was created. And then I ran htmerge. But I still get the "No matches found .." page. If you run rundig, you don't need to run htmerge separately. The rundig script will run htdig followed by htmerge. You should try running your /var/www/cgi-bin/htsearch program right from the command line first, to see if that works. If it does, it may be an Apache server configuration problem, or a problem with your search form. Did you make any changes to the /var/www/html/htdig/search.html search form? If so, see http://www.htdig.org/FAQ.html#q5.17 -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Words and files not being found or indexed
According to crosstar: This is a message for Gilles or anyone who is "senior" enough with the program to answer. I had written to Gilles, earlier, and he had said to post the questions here. I've been away from work and e-mail since last Wednesday, so I didn't get caught up on this thread until today. You see, there's a reason why I always redirect people to the list! I'm rather glad I missed this thread, actually, as the whole thing seems to have been an exercise in frustration. From the outset, I referred you to FAQ 5.25, but I didn't see any evidence from this whole, very long thread of discussion that anyone had looked at or followed the suggestions there. Was the language used in that question so indecypherable that no one could get anything from it? I realize it's written in technical language, but setting up a search engine correctly is a pretty technical problem, so if you don't understand the basics of Unix or Linux and how web servers work, you really should read up on that before attempting something like this. Anyway, if anyone can contribute suggestions as to how this FAQ entry can be better written, I'd be glad to hear them. If the problem was simply that no one bothered to look at the FAQ, then why am I wasting my time trying to update it? -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Q
According to ellenliu: Dear Sir: First ,I send my great gratitude to Gilles R. Detillieux and Daniel Naber for their warmhearted help. 3.2.0b3 has been installed on my system successfully. Which snapshot did you use? 3.2.0b3 is still a work in progress, and is slowly but surely getting closer to be ready for beta release. The 121700 snapshot is the most stable one so far. Here I have an another question :I had read through the source code before installing,but I want to trace some codes also now,would you please tell me which develope tool is good at debugging and/or tracing C/C++ program for Red Hat Linux platform? I think most Red Hat Linux users would suggest gdb, or perhaps xxgdb. If the C++ program you're debugging is htdig, I'd also suggest using the debugging output already programmed into it, and activated with multiple -v options, as you get a lot of feedback that way. (I'm a big believer in debugging trace prints in general, and do most of my C/C++ debugging that way.) Moreover,I had run it on my LAN,but when I search some words,it always gave me " no found "page,(I run it like this command line: htsearch word).I'd like to know whether this problem is caused by my operation reason. You should run htsearch from the command line either with no arguments at all, and let it prompt you for the search words, or you should give it a full CGI-style query string as an argument, e.g.: /opt/www/cgi-bin/htsearch 'words=butterfly+valvemethod=and' Be sure to quote the query string if it contains any shell meta characters such as "", ";", "*", etc. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] locale:ru on Solaris
Sorry, but there's absolutely nothing I can do with the core file itself, as I don't have a Solaris system. What I want is for you to get a stack backtrace using your htsearch executable and your core dump file using your debugger. Either that or run htsearch directly under your debugger, and when it fails get the stack backtrace directly from the in-memory copy of the program. If you use gdb, the procedure is described in the FAQ. If you use another debugger, you'll need to figure out how to do it with that debugger. According to Eldar Imangulov: Hello! Thanks for your will to help me. Here it is the core. Regards, Eldar Imangulov project manager (design hosting) [EMAIL PROTECTED] phone/fax.: +7 095 777.09.10 Global Chance Bld.1, 42 Bolshaya Yakimanka st., Moscow 117049 Russia // -Original Message- // From: Gilles Detillieux [mailto:[EMAIL PROTECTED]] // Sent: Tuesday, December 12, 2000 8:33 PM // To: Eldar Imangulov // Cc: [EMAIL PROTECTED] // Subject: Re: [htdig] locale:ru on Solaris // // // According to Eldar Imangulov: // I'm useing Solaris 7 // // I made the htDig and now I try to make search my site in russian // (windows-1251). // // in htdig.conf I said the // locale : ru // // The website indexing is going well but the htsearch does not work // (coredump). // // But without russian language (indexing by default = without // locale:ru) // indexing htsearch works well togather. // // What is the problem??? // // Hard to say, but from what you describe it sounds like a // problem with the // locale tables for your locale, or a database corruption problem of some // sort, perhaps. Could you give us a stack backtrace of htsearch's core // dump to narrow things down a bit? // // See the latter part of http://www.htdig.org/FAQ.html#q5.14 -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Htdig as external Link Checker? (Maybe off-topic)
According to Reich, Stefan: I need to generate a List for my boss, which contains all external Links of our Web-Site (which gets already indexed by htdig) including the status (means if the target of this link exists or not) You should have a look at Gabriele's ht://check program, which is partly based on htdig. It's on the sourceforge.org web site, I believe. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] result count is too small ?
According to Dennis Director: I am running htdig-3.2.0b2, I recently moved from htdig-3.1.5. Sometimes, the result count that I get back from a search is too small. For instance, below it said I have ten matches but only gave me two. It's hard to say for sure what's happening, but 3.2.0b2 has a number of known bugs, which are fixed in the latest development snapshot for 3.2.0b3. The infamous scoring bugs might account for the behaviour you see. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] Re: I need your help [from ellenliu]
Hi, Ellen. First of all, you should always send these questions to the list, and not to me personally. I don't have all the answers. See http://www.htdig.org/FAQ.html#q1.16 According to ellenliu: Dear Gilles R. Detillieux: I'm very grateful for your kind help last time. All these problems happened before compilation,during the Configure process. Because I can't get the most recent development snapshot of 3.2.0b3 They're in http://www.htdig.org/files/snapshots/ However, if you don't need any of the new features in the 3.2 series, you're probably better off with 3.1.5. I run 3.1.5 instead,but there still exit some problems. I entered : "sh ./configure" , it prompts: ". checking host system type ... ./configure: ./config.guess: no such file or directory configure configure:error:can not guess host type ;you must specify one configure :error :./configure failed for db/dist" I think that it can't pass through the check of 'host system type',I have read through the ./config.guess file ,but I 'm not clear what should I do yet.I know the default value of $host is NONE,whether need I set a type according to my machine? as I said last time when I run 3.2.0b2 the output prompts: " ... checking whether make sets ${MAKE}(cached) yes configure :error: can not run ./config.sub" in ./configure file I find the line (933):"if ${CONFIG_SHELL-/bin/sh} $ac_config_sub sun4 dev/null 21;then " why set the parameter sun4 ? would you tell me what I shoulddo next ? Thanks. configure: cpu :PIII 550M os: red hat linux 6.2 kernel 2.2.14-5.0 We've never seen anything like this before on Red Hat Linux systems of any version. Certainly not on 6.2. As I said last time, you may very well be missing some critical packages from your Red Hat distribution which are needed to compile and install software. The other thing I'm noticing is that there seems to be a problem with execution of scripts on your system. How did you extract the files from the .tar.gz distributions of either 3.1.5 or 3.2.0b2? Did you use chmod on any of the files, and in doing so turn off execute permissions on them? If you did, that's definitely going to be a problem! -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Hi, need help with searching database.
According to Akshay Guleria: I just installed Redhat7.0 on my machine. And then installed htdig rpm. I can see the page http://myhost/htdig/ which is the search page. Which htdig rpm did you install? For Red Hat 7.0, you should use the RPM for htdig-3.1.5-6 that comes with the 7.0 PowerTools. I make a search and for any search I make, it returns a page saying "No matches found for ... " Now, I ran rundig and it increased the file sizes in /var/lib/htdig. So, I presume the database was created. And then I ran htmerge. But I still get the "No matches found .." page. If you run rundig, you don't need to run htmerge separately. The rundig script will run htdig followed by htmerge. You should try running your /var/www/cgi-bin/htsearch program right from the command line first, to see if that works. If it does, it may be an Apache server configuration problem, or a problem with your search form. Did you make any changes to the /var/www/html/htdig/search.html search form? If so, see http://www.htdig.org/FAQ.html#q5.17 -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] htdig missing subdirectories (was: Incremental indexing)
Please direct your questions to the list, not to me personally. See FAQ 1.16. Also, you're off topic, as this has nothing to do with last week's "Incremental indexing" thread, so you should pick a more descriptive subject. According to crosstar: I have copiously poured over the messages in the mailing list, as well as references in FAQ. I am not very technical, but my situation is that htdig is missing a lot of files, words and subdirectories, altogether. I'm wondering if there is a simpler adjustment in htdig.conf to remedy this? I simply do not understand the instrtuctions, as given, unfortunately, and note that one reader says that he thinks tinkering with the server is not the answer. Did you follow the recommendations in FAQ 5.25 5.27? That's probably where you should focus your attention. Running htdig with the -vvv option will give you tons of output, but if you trace your way through there you might be able to see why it's missing parts of your site. I tried running htfuzzy but get the error: htfuzzy: No algorithms specified You need to tell htfuzzy which database to build. This won't solve your problem above, though. It's just for building databases for fuzzy match algorithms. I have changed one default up upping to: max_head_length:5 That will make htdig keep more of each document for use in excerpts for matched pages, but it won't get you more matches. However, upping the max_doc_size may get htdig to index more stuff if it was missing links from really large pages. See FAQ 5.1. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] locale:ru on Solaris
According to Eldar Imangulov: I'm useing Solaris 7 I made the htDig and now I try to make search my site in russian (windows-1251). in htdig.conf I said the locale : ru The website indexing is going well but the htsearch does not work (coredump). But without russian language (indexing by default = without locale:ru) indexing htsearch works well togather. What is the problem??? Hard to say, but from what you describe it sounds like a problem with the locale tables for your locale, or a database corruption problem of some sort, perhaps. Could you give us a stack backtrace of htsearch's core dump to narrow things down a bit? See the latter part of http://www.htdig.org/FAQ.html#q5.14 -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] htdig dumps core on Linus
According to B.G. Mahesh: Linux: 2.2.14-5.0smp (Redhat 6.2) HTDIG: 3.1.5 Apache: 1.3.14 When I search for few the word "rajkumar" on the news finder window on http://news.indiainfo.com/2000/12/08/india-index.html it gives me an error. When I check the cgi-bin dir I see a core file. % file core core: ELF 32-bit LSB core file of 'htsearch' (signal 11), Intel 80386, version 1 Why does this happen? It's hard to say without getting a stack backtrace from the core dump. First of all, did you install from an RPM or compile from sources. If you installed the wrong RPM, that could potentially lead to such problems. Mostly, though, this is a symptom of database corruption, which can happen if for example you have two htdig processes updating the database simultaneously. Did you try rebuilding the database from scratch, e.g. using "rundig", to see if that makes the problem go away? See the last paragraph of FAQ 5.14: http://www.htdig.org/FAQ.html#q5.14 -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] PDF problem
According to [EMAIL PROTECTED]: I am using htdig 3.1.5 on Linux. I get these errors when I try to index the files How can I fix the problem [ii@iinj-lxs015 bin]$ /disk2/v/apache/htdocs/VIRTUAL/ii/search/HTDIG//db/htdig11551.pdf: Unterminated string. PDF::parse: cannot open acroread output from http://www.indiainfo.com/awards/ET-ArmyInKashmir.pdf /disk2/v/apache/htdocs/VIRTUAL/ii/search/HTDIG//db/htdig11551.pdf: Could not repair file. PDF::parse: cannot open acroread output from http://travel.indiainfo.com/utilities/passport/passport_app.pdf /disk2/v/apache/htdocs/VIRTUAL/ii/search/HTDIG//db/htdig11551.pdf: Could not repair file. PDF::parse: cannot open acroread output from http://travel.indiainfo.com/utilities/passport/lostpp.pdf The "Could not repair file" error message is usually a sign that the PDF files are being truncated because of a setting of max_doc_size that's too low. See http://www.htdig.org/FAQ.html#q5.2 -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Daft Question - How to Apply patch under Solaris - Bit off
According to Duncan Brannen: I'm trying to apply the aarmstrong URL rewite patch to htdig-3.1.5 I assumed I use patch -i htdig.diff under Solaris (8) however, I assumed it would pick up the file names to be patched since they're in there but nope - I have to specify the names then I get patch -i htdig.diff Looks like a new-style context diff. File to patch: htdig/Retriever.cc Malformed patch at line 16: patch: Line must begin with '+ ', ' ', or '! '. (This is where the next diff line starts) If I chop the file up into separate diffs and apply them individually it all works fine The man file for Path really sounds like it should read the file and work it out for itself. Am I missing something? Yes, whenever you want to patch files in subdirectories, or use patch files with pathnames in the filenames, you need to use the -p option to tell the patch command how the pathnames are supposed to line up on your filesystem. In this case, you should go into the main htdig-3.1.5 source directory and use "patch -p1 htdig.diff". The -p1 tells patch to strip off the first pathname component from file names in the patch file. See "man patch". I'm not sure what the -i option is for. The GNU version of this command doesn't seem to have a -i. The error message you got at line 16 is a bit worrisome, as this does seem to be a properly formed patch, so I don't know why it's expecting a bigger hunk of diff code than it's getting. You may need to switch to the GNU version, or apply the patch by hand (it consists of fairly simple additions). -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] core dump--help
According to Shakaib Sayyid: I am getting a core dump on htsearch 3.1.5 using linux 6.2--2.2.14-5.0kernel. following is the output from "gdb htsearch core": Core was generated by `htsearch'. Program terminated with signal 11, Segmentation fault. Reading symbols from /usr/lib/libz.so.1...done. Reading symbols from /usr/lib/libstdc++-libc6.1-1.so.2...done. Reading symbols from /lib/libm.so.6...done. Reading symbols from /lib/libc.so.6...done. Reading symbols from /lib/ld-linux.so.2...done. #0 0x807f05a in __bam_cmp () at HtCodec.cc:20 20 // End of HtCodec.cc (gdb) bt #0 0x807f05a in __bam_cmp () at HtCodec.cc:20 #1 0x8085c27 in __bam_search () at HtCodec.cc:20 #2 0x8081255 in __bam_c_search () at HtCodec.cc:20 #3 0x807fa6f in __bam_c_get () at HtCodec.cc:20 #4 0x8067217 in __db_get () at HtCodec.cc:20 #5 0x805ab66 in DB2_db::Get (this=0x80d85f0, key=@0xbfffe250, data=@0xbfffe2b0) at DB2_db.cc:334 #6 0x8059ddb in Database::Get (this=0x80d85f0, key=0x80ec2a0 "peter", data=@0xbfffe2b0) at Database.cc:77 Looks like a database corruption problem. Try rebuilding the database from scratch, e.g. using "rundig", and see if that gets rid of the problem. If the problem recurs after this, it could indicate something else is going wrong. Note too that there is no lockout on the database, so if you accidentally start two processes (htdig and/or htmerge) that try to update the database simultaneously, that can really mess up the database. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] SQL handling start_url
According to Curtis Ireland: Is there any way to have start_url get its list from an SQL back-end? Has anyone already built a patch to handle this? Here are a couple of solutions I can think of to bi-pass the problem, but I'm sure I'm not alone in desiring this feature. 1) Build a PHP link built with links to all the sites we want to index. Have htDig use this as its start_url 2) Before htDig starts its database build, dump all the links to a text file and have the htdig.conf include this file The one problem with these two solutions is how would the limit_urls_to variable work? I want to make sure the links are properly indexed without going past the linked site. Either solution seems workable - it all depends on what your preference is. For the first solution, you'd need to have a limit_urls_to setting that's liberal enough to allow through all the links that the PHP script will spit out. You should probably set your max_hop_count to 1 to avoid having htdig go beyond the first hop, from the PHP output to the documents it references. For the second solution, you could probably just leave limit_urls_to as the default, which is the same as the value of start_url, and set your max_hop_count to 0. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Pb indexing HTML with htdig 3.1.5
According to =?iso-8859-1?Q?Andr=E9?= LAGADEC: I use htdig 3.1.5 on a Red Hat Linux 5.0, and I want to index a new web site. But when I run rundig I get only one document. So to see what is doing, I use rundig -vvv and I get this output : Header line: HTTP/1.1 200 OK Header line: Server: Netscape-Enterprise/3.5.1C Header line: Date: Wed, 06 Dec 2000 07:32:02 GMT Header line: Content-type: text/html Header line: Last-modified: Mon, 15 Nov 1999 10:45:01 GMT Translated Mon, 15 Nov 1999 10:45:01 GMT to 1999-11-15 10:45:01 (99) And converted to Mon, 15 Nov 1999 10:45:01 Header line: Content-length: 1258 Header line: Accept-ranges: bytes Header line: Connection: close Header line: returnStatus = 0 Read 1258 from document Read a total of 1258 bytes Tag: html, matched -1 head: size = 1258 pick: x.y.z.t, # servers = 1 htdig: Run complete htdig: 1 server seen: htdig: x.y.z.t:8000 1 document You should be getting much more output than that with a verbosity level of 7! Is it possible that there is a NUL byte in the document, soon after the "html" tag? For some reason, htdig seems to be stopping right after this tag, and not getting anywhere close to the other tags in the document. I've tried it myself on the document you sent, and on that copy it worked fine. The comment around the JavaScript code is correct, and htdig was able to handle it. There must be something different in your copy of the document, such as a NUL byte, which is causing htdig's parser to end prematurely. I think that htdig doesn't like the HTML code "!--//" and "//--", and it see beginning of comment but not the end and ignore the rest of HTML code of the page. I am true ? An other idea ? What can I do ? N.B. : The HTML code of the first page on the site is under this line. _ html head titleAccueil DIRECTION/title base target="rtop" script language="JavaScript" !--// var url=""; var nom=""; var bName=""; function Ouvrir() { bName = navigator.appName Version = navigator.appVersion Version = Version.substring(0,1) browserOK = ((Version = 2)) if (browserOK) { this.name="home"; msgWindow=window.open("actu/default2.htm","popupdpd","location=no,toolbar=no,status=no,directories=no,scrollbars=yes,width=400,height=450"); bName=navigator.appName; if (bName=="Netscape") msgWindow.focus(); } } Ouvrir() //-- /script /head frameset framespacing="0" border="false" frameborder="0" cols="155,*" frame name="gauche" scrolling="no" noresize target="haut_droite" src="defaulta.htm" marginwidth="0" marginheight="5" frameset rows="*,45" frame name="texte" target="bas_droite" src="defaultb.htm" scrolling="auto" marginwidth="0" marginheight="0" noresize frame name="bas" src="basac.htm" scrolling="no" marginwidth="7" marginheight="15" noresize /frameset noframes body pCette page utilise des cadres, mais votre navigateur ne les prend pas en charge./p /body /noframes /frameset /html -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Incremental indexing
According to Wanrong Qiu: Does htdig support incremental indexing? I mean it is possible to only index new created or modified files. Thanks in advance. Yes, this is what htdig does by default if there is an existing database, and the htdig program is called without the -i (initialize) option. However, the rundig script that comes with the package calls htdig with the initialize option, as its main purpose is to create all the initial databases, so don't use the standard rundig script for update runs. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] htdig fails to parse all files
According to Jeffery T Aiken: I've compiled htdig 3.1.5 on a Solaris 2.6 system. I have 5 directories on my web server containing a total of 54190 html docs and when I run htdig it only finds just over 18,000. I've used the -vvv -s options and see no errors during the dig. I am able to successfully htmerge these into the database and search, but can't figure out why htdig doesn't see them all. Anybody have an idea where I can go from here? Have you looked at FAQ 5.25 5.1 ? FAQ:http://www.htdig.org/FAQ.html -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Htdig with geramn umlaut under slackware
According to Jun Dong ([EMAIL PROTECTED]): Thanks for your tips. In Slackware 7.0 Packages there is no files of LC_CTYPE , LC_* etc.. under /usr/lib/locale/de or deutsch. Under /usr/lib/locale/de is only Directory LC_MESSAGES. I have copied directory de_DE which includs all files LC_* from SUSE 6.2 to SLACKWARE /usr/lib/locale und made symblolink de - de_DE. With your testlocale.cc code, after the code compiIed, I give command testlocale de and the screen prints out exactly german accents with Umlaut. But unfortunately Htdig is always no function with german accents despite how I exactly configured Htdig.conf. This is really system problem from Slackware. The problem with copying from a different system is that the C library may be different, and therefore may require a different set of file formats for locale support. This was the case in the transition from libc5 to glibc. However, if testlocale.c did recognize the German umlauts as alphanumeric, then it would suggest that things are mostly working correctly. I don't know why, but there are a few systems where this test program works, but htdig's locale support doesn't. I don't know what else to point the finger at besides the C library, though. In other way I have found the Tips from: ftp://sol.ccsf.cc.ca.edu/htdig/paches/3.1.5/accents.zip.README I have modified HTML.cc and htsearch.cc again and recompiled Htdig and no more definition with locakle again. Finally Htdig with german accents is successfully installed. you can find the url where I installed Htdig: http://www.homepagemagazin.de/htdig/ The problem with the accents.zip patch is that it ends up stripping off all accents by converting all accented letters in the ISO-8859-1 character set to their unaccented counterparts. So, the excerpts won't contain the accents. While this isn't as nice as the accents.5 patch, which adds accent support as a new fuzzy match method, the patch you used is at least better than nothing for a system that doesn't properly support locales. Gilles Detillieux wrote: I believe there are still problems with locale support on Slackware Linux systems. See the thread entitled "Portuguese" from this past May: http://www.htdig.org/mail/2000/05/index.html#61 I never did get a followup message from Rodrigo indicating whether he had found a solution, but you may want to try the tips I gave him. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] 3.1.5 Compile problems on Linux
According to Foerst, Daniel P.: I am using RedHat 6.2 with GCC 2.95.2 with GNU ld 2.9.5, and I have libstdc++ 2.9.0-30 installed (latest version). This is htdig-3.1.5 I am not able to figure out what is going wrong.. any assistance you can lend is greatly appreciated! ... I run the configure and have the following... ... prefix= /home2/htdig # This specifies the root of the directory tree to be used for programs # installed by ht://Dig exec_prefix=${prefix} I'm not positive about this, but I think in makefiles like this one, you need to use the syntax $(prefix), and not ${prefix} (i.e. use parentheses instead of braces). ... When I run make, everything works well, but then this slew of errors takes place. Entering directory `/sys2/installs/htdig-3.1.5/htfuzzy' gcc -o htfuzzy -L../htlib -L../htcommon -L../db/dist -L/usr/lib Endings.o EndingsDB.o Exact.o Fuzzy.o Metaphone.o Soundex.o SuffixEntry.o Synonym.o htfuzzy.o Substring.o Prefix.o ../htcommon/libcommon.a ../htlib/libht.a ../db/dist/libdb.a EndingsDB.o: In function `Endings::createDB(Configuration )': /sys2/installs/htdig-3.1.5/htfuzzy/EndingsDB.cc:46: undefined reference to `cout' /sys2/installs/htdig-3.1.5/htfuzzy/EndingsDB.cc:46: undefined reference to `ostream::operator(char const *)' /sys2/installs/htdig-3.1.5/htfuzzy/EndingsDB.cc:52: undefined reference to `cout' /sys2/installs/htdig-3.1.5/htfuzzy/EndingsDB.cc:52: undefined reference to `ostream::operator(char const *)' All of these should be in the libstdc++ library. However, the makefile is trying to link these with gcc rather than g++ or c++, which is probably a big part of the problem. I suspect something went wrong during the run of ./configure, most likely because your C++ compiler and libraries aren't installed where the configure program expected to find them. ... /sys2/gcc/lib/gcc-lib/i686-pc-linux-gnu/2.95.2/../../../../include/g++-3 /iostream.h:106: undefined reference to `endl(ostream )' /sys2/gcc/lib/gcc-lib/i686-pc-linux-gnu/2.95.2/../../../../include/g++-3 /iostream.h:106: undefined reference to `cout' /sys2/gcc/lib/gcc-lib/i686-pc-linux-gnu/2.95.2/../../../../include/g++-3 /iostream.h:106: undefined reference to `endl(ostream )' These error messages suggest that the C++ header files are not in the standard location. The compiler found them OK, but things are messing up and the linking stage. Is there a reason why you didn't just use the egcs-c++ and libstdc++ RPM packages that came with Red Hat 6.2? Those work fine with ht://Dig. I suspect that your setup as it is now wouldn't work well with any software that needs C++. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Named characters in search output
According to Tamas Nagy: Hello, When using "rarr;" (right arrow) named character in the first part of HTML documents, htdig seems to generate "amp;rarr; romaacute;" in the preview of documents. It is a bit strange, maybe a bug, because this string should generates a right arrow... Cheers, Tamas PS: Config: HtDig 3.0.2b2, RedHat 7 I assume you mean 3.2.0b2. This is a known problem, which is fixed in the 3.2.0b3 development snapshots. See http://www.htdig.org/FAQ.html#q5.22 -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] How Can I use htdig to index two or more websites?
According to Sean Harris: How Can I use htdig to index two or more websites? Thank you for your help!:-) Just add all the URLs you want to the start_url attribute, and possibly adjust limit_urls_to if you want something less limiting than what you've put in start_urls. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Help me to Search using Chinese!!!!!!!
According to Sean Harris: Help me to Search using Chinese!!! I'm afraid the answer hasn't really changed from 2-1/2 weeks ago. ht://Dig only supports 8-bit character sets. See http://www.htdig.org/FAQ.html#q4.10 This topic has been discussed many times on the list, and there are still no volunteers to take on the huge amount of work it would require to adapt ht://Dig for full Unicode support, and to add in the word splitting algorithms needed for many Asian languages. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Do me a favor
According to ellenliu: I have downloaded the program of 'htdig-3.2ob2.tar' from your site. But I have trouble to run it on personal my computer. My computer has been installed 'Red Hat 6.2' , which kernel is 2.214. However, when I run '/configuer' ,on the 993 lines it calls 'config.sub' ,then the program exits along with the promotion 'can't run config.sub' . Would you do me a favor to tell me why this happened ,and the most important thing is how I can run it successfully? Moreover, when should the embedded database be compiled ,and how is it compiled? CONFIGUER of HARDWARE: CPU : Pentium processor 550 Hard disc: 20G Memery: 64M It would probably be helpful to see the full output from the ./configure program. This package has been successfully installed before on Red Hat systems (6.1, 6.2 and others), so I would think that the most likely problem is a missing component on your system. You may also want to try the most recent development snapshot of 3.2.0b3, instead of 3.2.0b2, as many known bugs in 3.2.0b2 have since been fixed. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] restrict values and htdig.conf
According to [EMAIL PROTECTED]: We have 3 htdig-searches in our website. there are 3 different databases that are indexed: ../htdig/db/database1 indexed with ../conf/htdig1.conf ../htdig/db/database2 indexed with ../conf/htdig2.conf ../htdig/db/database3 indexed with ../conf/htdig3.conf the database3 includes all sites while the other 2 databases contains only parts of the whole. Now i want to expand the html-form with a select-option as follows: select name="restrict" option value="" selected.. Database1/OPTION option value="http://www.../"on Database2 option value="http://www.../"on Database3 /OPTION /SELECT o.k.! but how can i use this restrict-value in my htdig.conf? According to the selection in the html-form i must call the right htsearch with the right database! You seem to be confusing two alternate methods of restricting search results. You use the restrict parameter on htsearch only when searching a database that contains everything, in order to restrict the results to a subset of that database, i.e. only the URLs that match a particular pattern. If you want the user to select separate databases, then you should leave the restrict input parameter as an empty string, and have the user select the value of the "config" input parameter, which should be one of htdig1, htdig2 or htdig 3, i.e. the three configuration files you mentioned above with the directory and .conf file name extension stripped off. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Htdig with geramn umlaut under slackware
According to jdong: i have installed Htdig under slackware 7.0 and configured as german version. in htdig.conf: locale: de_DE lang_dir: ${common_dir}/german bad_word_list: ${lang_dir}/bad_words endings_affix_file: ${lang_dir}/german.aff endings_dictionary: ${lang_dir}/german.0 endings_root2word_db: ${lang_dir}/root2word.db endings_word2root_db: ${lang_dir}/word2root.db were added und files bad_words,german.aff and german.0 are copied under those directory. Everything is goning ok. Htdig can find every words except german umlaut such as ä (ä) ... my Linux Slackware was installed as german version, wenn i tip locale -a in command line: locale -a .. de deutsch de_DE Whatever i set LANG and LC_CTYPE = de_DE or de or deutsch, htdig is always no search funktion with german umlaut. But same htdig installed under Linux SUSE 6.2 und same configured there is no problem with german umlaut. I don't known how can i configure slackware locale and resolve this problem? I believe there are still problems with locale support on Slackware Linux systems. See the thread entitled "Portuguese" from this past May: http://www.htdig.org/mail/2000/05/index.html#61 I never did get a followup message from Rodrigo indicating whether he had found a solution, but you may want to try the tips I gave him. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] 3.20b2 -- oddity
According to [EMAIL PROTECTED]: I tried to do htsearch, using the following .conf file: site_id:10009 include:/www/vhosts/a/autosearchusa.com/htdig3.2b2/conf/cv_0.conf database_dir: /www/vhosts/a/autosearchusa.com/htdocs/www/u-wrk /sngl/data database_base: ${database_dir}/dt_${site_id} Point of interest is that, within the included file, values of database_dir/base were database_dir: /www/vhosts/a/autosearchusa.com/htdocs/www/u-dvl/ sngl/data # this way for htdig database_base: ${database_dir}/dt_${site_id} Wanted data was in the " . . . u-wrk . . " node. Initial search found wanted data. Second search (for 11th, etc, result), however, tried to obtain data from the "u-dvl" (and failed due to not there). Changed "u-dvl" to u-wrk, in the included file, and all worked as intended (did, btw, verify that SAME config file was being used at all points). Seems as if override of database_base should either happen, or not happen, consistently. There were two bugs in 3.2.0b2, which are fixed in the 3.2.0b3 development snapshots, which would have worked together to cause the behaviour you observed. The first was that on followup pages for a given search, the "config=" parameter got doubled up, causing the configuration to be read twice. The second was that the stack that tracks includes got messed up after the first main config file was read, so when extra config parameters were handled, includes in these additional config files weren't handled properly causing the parser to stop reading right after the include. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Decoding -v output.
According to Eric Bliss: Is there any place where I can find a listing of what each field of the -v output of htdig is for and what the various values (including the part where it gives the - + and *) mean? Here's one for the FAQ... When htdig -v spits out a line like this: 23000:35506:2:http://xxx.yyy.zz/index.html: ***-+--++***+ size = 4056 The first number is the number of documents parsed so far, the second is the DocID for this document, and the third is the hop count of the document (number of hops from one of the start_url documents). After the URL, it shows a "*" for a link in the document that it already visited (or at least queued for retrieval), a "+" for a new link it just queued, and a "-" for a link it rejected for any of a number of reasons. To find out what those reasons are, you need to run htdig with at least 3 "v" options. If there are no "*", "+" or "-" symbols after the URL, it doesn't mean the document was not parsed or was empty, but only that no links to other documents were found within it. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Decoding -v output.
According to Eric Bliss: Many thanks for answering this question. You're right, it should have been in the FAQ. I just committed to CVS the answers to FAQ 5.26 (htdig -v output) and 5.27 (reasons for rejection). They should be up on the web site within an hour. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] powepoint and excel to html/text filter
According to Cheng-Wei Cheng: Re: powepoint and excel to html/text filter can anyone give me some pointers thanks.. cheng Have a look at the latest version of doc2html on the htdig.org web site: http://www.htdig.org/files/contrib/parsers/README.doc2html http://www.htdig.org/files/contrib/parsers/doc2html.tar.gz You will need to obtain the actual conversion filters that doc2html uses, but its documentation will tell you where you can find them. According to David J Adams: Version 2.1 uses both the magic number and the MIME type to decide which conversion utlitity to use, and is able to cope with: MS Word (most versions including Word2 and Word for MAC) MS Excel MS Powerpoint Wordperfect (purchase of wp2html necessary) Adobe PDF Postscript RTF There are number of minor improvements, including a useful improvement in the conversion of PDF files. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Same problem with ~s
According to Ing. Noel Vargas Baltodano: I just don't know how to make htdig to check the /httpd subdirs and the /~username URLs. If anyone is kind enough to explain it to me AS CLEAR as posible, or tell me where I can get the right answer to this problem, I'd be very grateful. And I'd be grateful if someone would help me to make FAQ 5.25 as clear as possible, as I fear the current wording may be missing the mark. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] problems building htdig on cygwin
According to Geoff Hutchison: At 10:03 AM -0800 11/26/00, Joe R. Jah wrote: There is a chance that stubs.h would also require other heather file(s), and those files require yet other files ... ad infinitum;( ;))) In short, don't hold your breath. And I have said already if someone can find me a strptime replacement for the systems that don't have it (e.g. BSDI and cygwin evidently), I'll use it instead. Until then, you are correct, I don't know of a way of resolving it. (And as you say, including other header files might continue ad infinitum, which seems silly to me.) Remind me again, what was the problem with the strptime replacement in 3.1.5? I know there was a y2k bug, which I fixed almost 2 years ago, but was there anything else? If it was because it left other fields uninitialised, then I think we solved that too, didn't we? All this nonsense about finding a langinfo.h for strptime, and then finding an nl_types.h for langinfo.h, ad infinitum, is beyond silly. Here's why: no one has stopped to question why these headers are needed in the first place. These are all for NLS support, which is precisely what we DO NOT want in htdig! The locale support in htdig deliberately sets LC_TIME handling back to the "C" locale specifically to avoid having the time in Last-Modified headers and other headers parsed under the rules of other locales. We don't want this! So why are we fighting to crowbar an NLS-ready strptime into the distribution when we had one that worked without all the extra baggage? -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] Re: extra_word_characters (PR#952)
According to Tomas Frydrych ([EMAIL PROTECTED]): Version: 3.1.5 I need to add '+' to the list of valid word characters; after doing so htdig will index all words that contain '+' inside, but refuses to index words that start with '+' (and I suspect also words that end with it). OK, I was able to reproduce the problem after all. I had limited my tests before to htdig only, but the problem was in htmerge. It gives special meaning to lines in the db.wordlist file that begin with "+", "-", and "!", to mark document IDs that are unchanged, discarded or superceded. Trouble is htmerge reads the wordlist assuming a valid word would never begin with one of these, so its test for these is too liberal. Here's a patch to correct the problem, so that you can add any of these three special characters to extra_word_characters and allow words that begin with one of them. Apply it in the htdig-3.1.5 main source directory using "patch -p0 this-message-file". --- htmerge/words.cc.wordbugThu Feb 24 20:29:11 2000 +++ htmerge/words.ccFri Nov 24 09:54:27 2000 @@ -74,37 +74,40 @@ mergeWords(char *wordtmp, char *wordfile // while (fgets(buffer, sizeof(buffer), sorted)) { - if (*buffer == '+') + // + // Split the line up into the word, count, location, and + // document id. + // + word = good_strtok(buffer, '\t'); + pair = good_strtok(NULL, '\t'); + if (!word || !*word || !pair || !*pair) { + if (*buffer == '+') + { // // This tells us that the document hasn't changed and we // are to reuse the old words // - } - else if (*buffer == '-') - { + } + else if (*buffer == '-') + { if (removeBadUrls) { discard_list.Add(strtok(buffer + 1, "\n"), 0); if (verbose) cout "htmerge: Removing doc #" buffer + 1 endl; } - } - else if (*buffer == '!') - { + } + else if (*buffer == '!') + { discard_list.Add(strtok(buffer + 1, "\n"), 0); if (verbose) cout "htmerge: doc #" buffer + 1 " has been superceeded." endl; + } } else { - // - // Split the line up into the word, count, location, and - // document id. - // - word = good_strtok(buffer, '\t'); - pair = good_strtok(NULL, '\t'); wr.Clear(); // Reset count to 1, anchor to 0, and all that sid = "-"; while (pair *pair) -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
[htdig] Re: valid_punctuation setting (was: extra_word_characters (PR#952))
According to Tomas Frydrych: I do have one question though; when defining valid_punctuation, do I have to include ' ' (i.e. space), or is ' ' always included, and if I have to include it explicitely, where/how do I put into in the string? No, white space characters (space, tab, newline) are treated separately from valid_punctuation and any other punctuation characters. The htdig parser uses the C library function isspace() to test if a character is a white space character, and these are usually defined by your locale, although with any ASCII or ISO character set these will be pretty much the standard three characters above, and perhaps a few more obscure ones. It would not make sense to add a space to valid_punctuation, nor can you. The valid_punctuation characters are those that are allowed within a compound word. Historically, a word like "post-doctoral" was indexed only as "postdoctoral" if the "-" was in valid_punctuation. In more recent versions, it is indexed as "postdoctoral", "post" and "doctoral". But you see how valid_punctuation characters have a special meaning within a word. They don't cause a distinct break between words the way that any other punctuation character would, or the way that white space would. E.g. the comma "," is not normally included in valid_punctuation so it always breaks words apart, while the hyphen or apostrophe can appear within a word (in English, in any case). -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] libstdc++.so.2.10.0
According to David Robley: On Fri, 24 Nov 2000, NSWPS Intranet Project wrote: Recieve following on rundig : intranet02 # rundig ld.so.1: /usr/pkgs/www/bin/htdig: fatal: libstdc++.so.2.10.0: open failed: No su ch file or directory Killed ld.so.1: /usr/pkgs/www/bin/htmerge: fatal: libstdc++.so.2.10.0: open failed: No such file or directory Killed ld.so.1: /usr/pkgs/www/bin/htnotify: fatal: libstdc++.so.2.10.0: open failed: No such file or directory Killed ld.so.1: /usr/pkgs/www/bin/htfuzzy: fatal: libstdc++.so.2.10.0: open failed: No such file or directory Killed ld.so.1: /usr/pkgs/www/bin/htfuzzy: fatal: libstdc++.so.2.10.0: open failed: No such file or directory Killed PLEASE HELP!!! Sean What OS are you running (before someone else asks)? I'd bet it's Solaris, although the problem may occur on other platforms, and the solution is likely the same. Please see http://www.htdig.org/FAQ.html#q3.6 -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Does htmerge remove URL from database ?
According to Olivier Korn: 3. Once a week, htdig is called on each site with "htdig -i -c site1.conf" then "htdig -i -c site2.conf", (and so on.) 4. After all the sites have been htdigged, I run htmerge in sequence in order to merge all the small databases into one. First call is "htmerge -c site1.conf", subsequents call are "htmerge -c site1.conf -m site2.conf", "htmerge -c site1.conf -m site3.conf", (and so on.) ... 2. Now let's hear the amazing part of my story. If I do a "htmerge -c site5.conf" (notice there is no -m this time.) and if I htsearch -c site5.conf with "rénovation tourisme" my document is said to be found ! Said in another way, the document was indexed but was certainly ripped out when merging with another database. I think after each separate htdig -i -c site#.conf you should run a separate htmerge -c site#.conf, not just on the first site, before you merge everything together. Try that and see if it solves the problem. I think the intention was that these extra merges should not have been necessary, but this has come up before, and I think there's a problem with merging multiple DBs when they haven't already been cleaned up by a simple htmerge. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] different search results
According to Geoff Hutchison: On Mon, 20 Nov 2000, David Adams wrote: Or does one really get a link which when followed brings up the .PDF document open at the relevant page? If so, that would be quite something, especially if it worked for a range of browsers. What would be the correct HTML a name="..." tags for the anchors? This is on the right track. Basically, you can pass along information to Acrobat to open to a particular page. So AFAIK, it works with all browsers that support the Acrobat PDF plugin. Pierre Olivier discussed the technique some months ago on this list, and has a web page that describes it. I forget the URL, but you'll find it quickly with a Google.com search for "pdftodig". There's also a little script that implements the same capability in xpdf, for locating the right page. The technique involves using a cgi script URL in the anchor tag, with the cgi script spitting out some XML for Acrobat. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] Redirection of Htdig output -- 3.20b2
According to [EMAIL PROTECTED]: Following line of Perl code is intended to run htdig, and send STDOUT to /htdig3.2b2/autoshop-online._htdig.log; system "/htdig3.2b2/bin/htdig","-svic","/htdig3.2b2/sngl/conf/autoshop-online.conf"," ", "/htdig3.2b2/autoshop-online._htdig.log"; The execution of Htdig produces valid content in STDOUT, but it goes to STDOUT itself (as opposed to the specified file). Best I can tell, from review of Perl (5.005_03) documentation, syntax of above command is valid. While I'm no Perl expert, I've never seen "system" used in this way. I think system("/htdig3.2b2/bin/htdig -svic /htdig3.2b2/sngl/conf/autoshop-online.conf /htdig3.2b2/autoshop-online._htdig.log"); will do what you want. The string just gets passed to the shell for parsing, as far as I know, so you use standard sh/ksh/bash syntax in the string. Perhaps in the syntax you used, the "" got passed literally as argument 3 to the htdig program. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] SSL Patch
According to Michael Arndt: When applying SSL.0 or SSL.2 (SSL.1 doesnt apply) to a htdig 3.1.5 fresh from Server, i get Problems when trying to compile on a linux box: ... Server.cc: In method `Server::Server(char *, int, int, StringList * = 0)': Server.cc:44: passing `const char *' as argument 1 of `String::operator =(char *)' discards qualifiers Try replacing lines 43-44 of the patched Server.cc with the following construct to see if it would keep your compiler happy: String url = "http://"; if (ssl) url = "https://"; -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html
Re: [htdig] ssl patch for ht://dig
According to Jeremy Lyon: Gilles, Thank you so much. That worked. Now I have a new problem. I am indexing from the local file system. Now when I do a search everything works fine except the urls that are returned for the ssl sites appear like this. http://ecom.uswest:443/path It's storing as a regular http:// instead of https:// and it's cutting off the .com. Any ideas. Not a clue, but then I haven't had a good long look at the SSL patch to see what it's doing. You should probably ask the developer of the orginal SSL patch (for 3.1.3, I think), as the current one is supposedly a straight port of it to 3.1.5. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this. List archives: http://www.htdig.org/mail/menu.html FAQ:http://www.htdig.org/FAQ.html