from:"Gilles Detillieux"

Re: [htdig] db.docdb db.wordlist

2001-01-22 Thread Gilles Detillieux


Check the FAQ:  http://www.htdig.org/FAQ.html#q5.16

You don't need to pre-create any of the database files.  They're created
as needed.  The problem is htdig isn't finding anything that it can index.
You need to find out why.  The FAQ should give you a starting point.

According to Cormac Robinson:
 I'm running 3.1.5. Unfortunatly I'm building the search engine on a server
 which was setup and is maintained over in England. Knowing them they
 probably installed red hat 6. from source.
 
 Can anybody send me a blank db.docdb  db.wordlist I'll try running it again
 if I get these files. If not, I might try get an earlier version of the
 search engine.
 
 Cormac.
 
 
 
  What version are you running? Did it come in one of Red Hat's .RPM
  packages or did you downloaded them?
 
  I'm using 3.1.5 (I think that's the right version number...), which came
  with RH 6.2 Pro. I installed the RPM, modified the config to the server's
  homepage (you may add another urls, it's up to you), then ran htdig.
 
  After that, I ran htmerge, and that was it. Had a little problem with
  user's accounts, which was promptly fixed (these guys really know their
  stuff), but I had no much problem.
 
   same thing here
  
   On Sat, 20 Jan 2001, Cormac Robinson wrote:
  
I've just installed htdig and after running rundig I get the following
 error
   
htmerge: Unable to open word list file
 '/home/httpd/docs/search/test/db/db.wordlist
   
DB" problem ...: /home/httpd/docs/search/test/db/db.docdb no such file
 or directory.
   
I've created a file in the db directory db.docdb - but I then get an
 error that it is not the correct file format.
Is there an initial startup file I need to run on a first run...
As far as I know the server I am running on is a redhat linux 6.0
   
Thanks
   
  
  
   
   To unsubscribe from the htdig mailing list, send a message to
   [EMAIL PROTECTED]
   You will receive a message to confirm this.
   List archives:  http://www.htdig.org/mail/menu.html
   FAQ:http://www.htdig.org/FAQ.html
  
 
  --
 
  Noel Vargas Baltodano
  [EMAIL PROTECTED]
 
  Gerente de Sistemas
  Nicatechnologies, S.A.
  http://www.nicatech.com.ni
 
 
 
 To unsubscribe from the htdig mailing list, send a message to
 [EMAIL PROTECTED]
 You will receive a message to confirm this.
 List archives:  http://www.htdig.org/mail/menu.html
 FAQ:http://www.htdig.org/FAQ.html
 


-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Re: Fw: doc2html

2001-01-22 Thread Gilles Detillieux


According to Leong Peck Yoke:
 Actually I was wrong, the code means replacing the soft hyphen \255 with 
 \055. I didn't read it carefully. Sorry for the inconvenience caused.
 
 Regards,
 Peck Yoke

No problems.  The octal code 055 is the ASCII hyphen (-), while 255 octal
is the ISO-8859-1 code for soft hyphen, which oddly enough is often used
to encode a long dash rather than a soft hyphen.  This little substitution
was something I added to parse_doc.pl, and kept in conv_doc.pl, because
it solved a problem for me in dealing with some of my PDFs.  I don't
think it should pose a problem for anyone else, but if it ever does it's
easily removed from the script.

 David Adams wrote:
 
  When I wrote doc2html I copied this without change from conv_doc, and I
  think it is the same in the original parse_doc parser script.  Is Leong
  correct?
  --
  David Adams
  Computing Services
  Southampton University
  
  
  - Original Message -
  From: "Leong Peck Yoke" [EMAIL PROTECTED]
  To: [EMAIL PROTECTED]
  Sent: Sunday, January 21, 2001 1:18 PM
  Subject: doc2html
  
  
  Hi,
  
  I am look at your code doc2html.pl for a project. I notice that in
  function try_text at line 366, the following code
  
  s/\255/-/g; # replace dashes with hyphens
  
  seems to be wrong. Shouldn't it be "s/\055/-/g" instead?
  
  
  
  Regards,
  Peck Yoke
  
  
  
 
 
 
 To unsubscribe from the htdig mailing list, send a message to
 [EMAIL PROTECTED]
 You will receive a message to confirm this.
 List archives:  http://www.htdig.org/mail/menu.html
 FAQ:http://www.htdig.org/FAQ.html
 


-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Avoiding search on file name

2001-01-18 Thread Gilles Detillieux


According to Loys Masquelier:
 In fact, it seams that htsearch results are directories and files where the
 searched word is inside the directory or file name.
 
 Ex :
 /foo/foo.html
 Searched word : foo
 Result :
 /foo
 /foo/foo.html
 
 Is there a way to avoid htsearch to find those directories and files.

That's exactly what I thought the problem was.  Setting description_factor
to 0 and reindexing should prevent the foo.html file from coming up in a
search for foo, but suppressing the foo directory is a little more tricky.
For that, you should look into the suggestions in the "new ask" thread
from this past September, at

http://www.htdig.org/mail/2000/09/index.html#111


 Thanks.
 
 Loys.
 
 Gilles Detillieux a crit :
 
  Or perhaps, if I understand correctly, setting description_factor to 0
  and reindexing would be the way to avoid this.  If you point htdig to
  a directory that doesn't contain an index.html or equivalent file, and
  the web server automatically generates the directory index, then the file
  names will be used as link description text for the links to those files.
  If that's what is happening here, then you want to tell htdig not to
  put any weight on the words appearing in link description text, as above.
 
  According to Peterman, Timothy P:
   I think setting "title_factor" to 0 in the config file will do that.
   You'll probably need to reindex for that change to take effect.
  
   Loys Masquelier wrote:
   
Hello,
   
I have a problem in indexing a file hierarchy. Htdig by default indexes
all the names of all the files. When I search for a word, if that word
is found in a file name, htsearch return the file path. But I only want
files which contain the given word.
   
Is there a way to avoid that file name indexing ?
   
Thanks in advance.
   
Best regards.
   
Loys.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Spelling Help

2001-01-18 Thread Gilles Detillieux


According to Geoff Hutchison:
 At 1:34 PM + 1/18/01, David Adams wrote:
 1)What have other sites done to address this problem?  (Spell checking
 and correcting our own
 
 Use good fuzzy methods, including the synonym file. We are working on 
 additional fuzzy matching code, but of course if anyone can come up 
 with sample code that produces a list of suggestion words from an 
 input, we can probably port it.
 
 2)Can anybody recommend a _good_ (UK English) spell checker for IRXIX
 6.5?
 
 Yes. Try ispell with the UK dictionaries.

Back in October, Greg Holmes posted a python wrapper script for htsearch,
which used ispell to suggest alternative spellings.  The thread that
ensued is at http://www.htdig.org/mail/2000/10/index.html#295

The ispell package is GNU software, so it should port to IRIX easily
enough, I'd think, and the dictionaries are very customisable.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Indexing a given list of file

2001-01-18 Thread Gilles Detillieux


According to Loys Masquelier:
 I want to check that it is not possible to index a list of changed files
 without reindexing all the data.
 In fact the situation is that I know that that list of files needs to be
 reindexed and I want to do that as fast as possible.

You may be out of luck with the current version.  However, doing an update
dig is usually much, much faster than reindexing from scratch.  htdig will
ask the server for each document in the database, but only if it's been
modified since the last indexing run, so it can do this pretty quickly.

There was talk of adding to the 3.2 code a feature whereby you can tell
htdig not to recheck all the indexed documents, but only check a given
list of URLs.  I don't remember if this feature is already in the current
development snapshots.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Memory requriements

2001-01-18 Thread Gilles Detillieux


According to Pat Lennon:
 I have a Linux box with approx 1 gig of html and pdf books. I want to
 use htdig for the search engine. I dont want to assume to much
 butwill 1 additional gig of hard disk cover the size of the index
 database. I figure double may be a safe starting point. Also what type
 of memory requirements should i consider at a minimum? The hardware is a
 Cyrix 150 64 meg ram redhat 6.2 apache webserver. I know this is a vague
 question...I would just like some reasonable starting points???

I don't know for sure, but my gut reaction is that 64 MB of RAM is pretty
small for a 1 GB web site.  I'd think you'd at least want to double that.
However, it may work with what you have, although probably quite a bit
more slowly.

The web site size to database size ratio is a little hard to predict.
It depends a lot on how much of your web site is indexable text vs.
unindexable content (e.g. images, etc.), and what your config attribute
settings are (max_doc_size, max_head_length).  A 1:1 ratio is probably
safe enough for htdig 3.1.5, but databases tend to be bigger in 3.2.0bx.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] solaris 2.6 and htdig 3.1.5

2001-01-17 Thread Gilles Detillieux


According to Ronald Edward Petty:
 I think its 3.1.5(whatever the latest stable is).  Anyways I emailed
 yesterday about this
 ares:/export/netapp/user/rpy/htdig-3.1.5/htfuzzy/ make
 c++ -o htfuzzy -L../htlib -L../htcommon -L../db/dist -L/usr/lib Endings.o
 EndingsDB.o Exact.o Fuzzy.o Metaphone.o Soundex.o SuffixEntry.o Synonym.o
 htfuzzy.o Substring.o Prefix.o ../htcommon/libcommon.a ../htlib/libht.a
 ../db/dist/libdb.a -lz -lnsl -lsocket
 /usr/local/lib/gcc-lib/sparc-sun-solaris2.6/2.95.2/libgcc.a: could not
 read symbols: Bad value
 collect2: ld returned 1 exit status
 make: *** [htfuzzy] Error 1
 
 and now I was wondering, everywhere I search on the net I get the
 impression that gcc is calling the wrong linkers.  I type as -version and
 its the gnu assembler in my path, and same for id.  So I am assuming that
 there is a version of the solaris as or id that is messing me up.  2
 questions
 
 1) Is it possible there is another problem that can be generating this?  I
 ask this so I dont have to manually link all this, i have never done that
 before so maybe i should to learn... gee
 2) If noone thinks it is another problem... how can "watch" the make file
 call the linker,etc  if I use top it doesn't show.  I do where ld and get
 5 choices, and if i do /asdf/asdf/asdf/ld - version on 2 of them I get gnu
 and the other 3 i get invalid option, could these maybe be the solaris
 versions I cant tell , there is no option listed to tell.
 
 HELP(whinny voice)
 Thanks
 Ron Petty

Try compiling some different C++ code on your system.  I'm almost certain
that this is a problem with the setup of GNU C++ on your particular machine,
and not an ht://Dig problem.  If so, then this is not the best place to get
help, and you'd probably have better luck on a GNU C++ related mailing list
or newsgroup.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

[htdig] Re: Reindex

2001-01-17 Thread Gilles Detillieux


According to Elsa Chan:
 We just launched a new site, but the search engine is indexing pages that
 don't exist anymore. I think I just need to restart htdig except I don't
 know how. I trying search for info on theb htdig web site but I couldnjt
 find anything. Would you be able to help me?

Running the standard "rundig" script will rebuild your database from scratch.
You can also manually run "htdig -i" and "htmerge" to do this.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] /usr/include/netinet/in.h:837: syntax error before `struct' on

2001-01-17 Thread Gilles Detillieux


According to Ramesh Veema:
 While I do a make on my application ported to SOLARIS8
 in the middle of the make i get the following error when
 when a C file tries to include netinet/in.h and i having
 doubt since this header file supports ip6 aswell, so Iam
 not clear how to correct this  error, Pls help me if any on
 came across  with this error.
 
 
 /usr/include/netinet/in.h:837: syntax error before `struct'
 /usr/include/netinet/in.h:838: syntax error before `struct'

What application are you porting here?  This is a mailing list for the
ht://Dig search engine only.  If that's the application in question,
please send in the complete output from ./configure and make, as well
as information on which version you're compiling, and what patches,
if any, you've made.  If you're talking about some other application,
I'm afraid you have the wrong list.

In any case, it sounds like something is overriding a header file
definition in a way that's incompatible with what /usr/include/netinet/in.h
expects.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Can't exclude a directory in results

2001-01-17 Thread Gilles Detillieux


Last I heard, Mindspring was still using a rather ancient beta release of
htdig-3.0.8b2, which had numerous bugs.  The exclude parameter handling
didn't work correctly until version 3.1.0b2.  Also, we had problems
with the StringMatch class used to implement the restrict and exclude
parameters to htsearch, among other things, until 3.1.0b4.  That may be
the cause of these problems.  Also, you must make absolutely sure that you
only have one definition of the input parameters "restrict" and "exclude"
in your search form, as versions before 3.1.0b4 didn't handle multiple
parameter definitions for these.  The current stable release is 3.1.5,
which has been out for almost a year now, and fixes these and many, many
other bugs.

According to Dudley Jane:
 TKO,
 
 This isn't exactly what you're doing, but, we have a form to restrict,
 but we couldn't get it to work until we said:
 
 input type=hidden name=config value=htdig  
 input type=hidden name=restrict value="www.co.henrico.va.us/hr"
 
 It didn't work if we just said value="/hr" - we had to add the www.etc.
 in front.
 
 JD
 
 
 Carrot-Top Creative wrote:
 I've searched and double checked the instructions on how to exclude a
 directory from results and can't get it to work.  I'm using htdig on
 MindSpring which means I can't do any custom configuration to the server or
 have more than one htdig install.  I need to have 2 different search pages
 that return different results.  I've successfully created a search form that
 uses restrict:
 
 input type=hidden name=config value="www57080"
 input type=hidden name=restrict value="/98study/"
 input type=hidden name=exclude value=""
 
 But I can't get exclude to work using this code to not include a particular
 directory.
 
 input type=hidden name=config value="www57080"
 input type=hidden name=restrict value=""
 input type=hidden name=exclude value="/98study/"
 
 Help what can I do?  tko


-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] hidden keywords

2001-01-17 Thread Gilles Detillieux


According to Stephen L Arnold:
 I'm trying to achieve some requested behavior with htdig; ie, I've 
 setup some selections in the search page (using restrict and 
 exclude) and the desired behavior is to have the user enter nothing 
 in the search field and have htdig serve up a list of all documents 
 in the directories specified by restrict/exclude.
 
 All documents are Word docs (at the moment) and previously I had 
 added the keyword "doc" to the bottom of the search form (in the 
 hidden keyword field) and it worked.  However, that was when Apache 
 was configured to allow directory indexing (and the indexes would 
 show up in the search results, along with the documents).  I turned 
 off the Apache auto-index stuff, and built a single html file with 
 URLs for all the documents for htdig to do the actual dig.
 
 However, and here's the rub, now I get a boolean search error when 
 I submit a search with no keywords, even if I put more hidden 
 keywords in the search form (that are guaranteed to be in the 
 documents).
 
 The only thing that changed was the Apache auto-index stuff; is 
 there anything I can do to get the behavior I want back again?

I'm not sure how you had it working in the first place, if this was
with 3.1.5.  You must have had some value in the "words" input parameter,
because htsearch 3.1.5 (and earlier) doesn't like it when you have
"keywords" but no "words".  Here's a patch that fixes this:

ftp://ftp.ccsf.org/htdig-patches/3.1.5/any_keywords.0

The Apache auto-index stuff made the keyword "doc" match any .doc file,
because the index uses the file names as link description text for the
link to the file, and with a non-zero description_factor, these words
have weight in the search.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

[htdig] Re: Reindex

2001-01-17 Thread Gilles Detillieux


According to Elsa Chan:
 I try doing that, but only one file gets updated from htdig.
 
 /usr/local/htdig/db/db.docdb is the only file that gets updated.
 
 db.docs.index is still old and db.wordlist.new is created by it has 0 bytes
 
 When I try to run htmerge it gives me 
 
 htmerge: Unable to open word list file '/usr/local/htdig/db/db.wordlist'

As FAQ 5.16 explains, this happens because htdig didn't index any documents.

 I also try running htdig -vvv, but I get this
 
 1:0:http://www.site.net/
 New server: www.site.net, 80
 
 
 I specify in the config file to used a different port and I put the url in
 quotes but it doesn't seem to work properly
 
 Any ideas?

You can't use quotes in the start_url, because htdig doesn't parse it as
a quoted string list.  See http://www.htdig.org/attrs.html

The port number should be tacked right on to the end of the URL with a
colon, e.g.  start_url: http://www.site.net:8001

As for figuring out why it's hanging, and what constitutes a long while,
please see Geoff's response.

 -Original Message-
 From: Gilles Detillieux [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, January 17, 2001 10:18 AM
 To: [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]
 Subject: Re: Reindex
 
 
 According to Elsa Chan:
  We just launched a new site, but the search engine is indexing pages that
  don't exist anymore. I think I just need to restart htdig except I don't
  know how. I trying search for info on theb htdig web site but I couldnjt
  find anything. Would you be able to help me?
 
 Running the standard "rundig" script will rebuild your database from
 scratch.
 You can also manually run "htdig -i" and "htmerge" to do this.


-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

[htdig] Re: Problem with exclude_url

2001-01-17 Thread Gilles Detillieux


According to [EMAIL PROTECTED]:
 we have htdig 3.15.
 we wanted to index a big directory of the SAP-documentation
 the structure is as follows:
 
 directory1
  directory2
   directory3
directory4
 content.html
 frameset.html
 
 directory1
  directory2
   directory3
other_directory4
 content.html
 frameset.html
 
 and so on...
 
 We want to exclude all (!) files named frameset.htm in all directories.
 when i made: exclude_url: frameset.htm - nothing happend
 I think, that you must take the qualified path - but there are so many different
 paths in this case.
 
 I nedd something like:
 exclude_url: /directory1/ directory2/directory3/*/frameset.htm  (the asterix is
 important)
 Is this possible?

First of all, please see http://www.htdig.org/FAQ.html#q1.16
Such questions should go to the list, not to me personally.  This isn't a
one-man show.

Secondly, could you elaborate on what you mean by "nothing happened"?
Do you mean that htdig didn't index anything, or that the frameset.htm
or frameset.html files were not excluded?  Also, is the above
a typo, or did you really omit the "s" from exclude_urls?  See
http://www.htdig.org/attrs.html for correct spellings of attribute names.

Thirdly, there is no wildcard support for exclude_urls.  In version 3.2,
we're adding support for regular expressions to exclude_urls and other
attributes, which will be like wildcards only more powerful, but with
a somewhat more complicated syntax.  This is still a work in progress,
however.

You shouldn't need wildcards for this case, though, because it's a
pretty simple exclusion you're trying to do here.  However, if the only
links to some of your files, such as the content.html files, are in the
frameset.html, then you may not want to exclude them, or you'll end up
missing a whole lot more besides.  This is why I asked what you mean by
"nothing happened".  If not of the files were indexed, this may be why.
Remember that htdig only follows HTML links from one document to the
next.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] how do you index local pages in 3.1.5?

2001-01-17 Thread Gilles Detillieux


According to Jon Beyer:
 This is probably a really easy thing, but I can't get
 htdig to index HTML from my hard drive.  I tried
 setting start_url to file:/, but that didn't work
 and I played around with local_urls_only and
 local_urls but couldn't get it to work.  Any advice is
 greatly appreciated.  Thanks.

htdig 3.1.5 doesn't handle file:/ URLs, only http://... URLs.  You can
make local_urls work with this style of URL, if the documents are on the
same system as the one on which you run htdig, using a syntax similar to
this example from my system:

start_url:  http://www.scrc.umanitoba.ca/
local_urls: http://www.scrc.umanitoba.ca/=/home/httpd/html/
local_user_urls:http://www.scrc.umanitoba.ca/=/home/,/public_html/

where /home/httpd/html corresponds to my Apache DocumentRoot setting.

Note that local_urls only indexes a certain limited set of file types,
determined by file extension.  For any other file type, or for directory
URLs where there's no index.html, it falls back to the HTTP server.

See http://www.htdig.org/attrs.html#local_urls

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] solaris 2.6 and htdig 3.1.5

2001-01-17 Thread Gilles Detillieux


According to Ronald Edward Petty:
 Sorry for the questions.. I just thought someone said that there is a
 shared memory problem on solaris with htdig using that, and that u should
 use a certain linker (namely gnu) instead of (solairs).

I think you're confusing two unrelated responses to other problems.

There's a problem with shared library support, not shared memory,
and it affects Solaris systems running the 3.2.0 betas of htdig only.
It is resolved by using the --disable-shared option to ./configure.
This doesn't affect 3.1.5, because it's not a problem with the standard
C++ libraries, but only the libraries built in the htdig package.
3.1.5 doesn't build any shared libraries.

I doubt anyone recommended using a GNU linker.  We often recommend
using GNU make if there are problems with some of htdig's Makefiles
on some platforms.  If GNU makes a linker for Solaris, it's the first
I hear about it.  Usually, the GNU compilers will use the linker that
comes with the target operating system, if I understand things correctly.

  However the make
 file that htdig comes with i cannot really figure out if there is a
 certain linker to use.. this is a htdig thing not gcc.  If i do the proper
 compiler and linker etc then its a  gnu problem... that is the question,
 and i have not found an answer from the htdig site about what is the
 proper set up of compiler, linker, assembler.   I will ask the gnu people
 and see if this makes any since to them.. thanks for the help.

It's very, very rare to call the linker or assembler directly from a
Makefile for standard C or C++ code.  The htdig Makefiles certainly
don't attempt to do this!  Generally, for linking C programs, it's
the C or C++ front-end (e.g., cc, gcc, c++, g++) that gets called,
and this front-end is preconfigured to call the correct linker and
pass it all the required libraries.  Problems such as you reported
are a symptom of a mis-configured front-end to the C or C++ compilers,
or incompatible libraries, or both.  So, this is indeed a gcc thing.

The same goes for the assembler: gcc will typically call the first and
second stage back-end compilers for a .c file, to create a temporary .s
file, and then call the assembler to assemble it into a .o file.

If you can compile and link other C++ programs, it may be a problem with
the libraries your htdig Makefiles are telling the compiler to use,
but it could also be that your other programs are simple enough that
they don't run into similar compatibility issues.  In any case, the
htdig Makefiles don't call the linker directly.  If they call the wrong
front-end compiler, or use the wrong libraries or library directories,
you may need to change that in your Makefile.config, but this would
be a problem specific to your installation.  Lots of users have build
htdig successfully on Solaris, with nothing like the sort of errors you
reported occurring.

It might help to compare the options to gcc or g++, or whatever front-end
is used for linking the .o's and .a's in htdig, to the options used
for C++ programs you were able to link successfully.  Especially the -l
and -L options.  That might point the way to the problem you're having.
E.g., if you have different versions of libstdc++ or libg++ in different
directories, don't point g++ to an incompatible version with one of your
-L options!

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Problems compiling 3.20b2

2001-01-16 Thread Gilles Detillieux


According to richard:
 compiling went fine, after I installled zlib-1.1.3.
 But running rundig -c my.conf:
 
 Arithmetic Exception - core dumped
 
 core file from htfuzzy.

Oh, right.  On Solaris, you must use the --disable-shared option on
./configure to avoid this problem.  We still haven't gotten to the
bottom of this, but C++ objects in shared libraries don't seem to get
initialized properly on Solaris, causing this error.  Avoiding shared
libraries for ht://Dig's C++ classes avoids this problem.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Avoiding search on file name

2001-01-16 Thread Gilles Detillieux


Or perhaps, if I understand correctly, setting description_factor to 0
and reindexing would be the way to avoid this.  If you point htdig to
a directory that doesn't contain an index.html or equivalent file, and
the web server automatically generates the directory index, then the file
names will be used as link description text for the links to those files.
If that's what is happening here, then you want to tell htdig not to
put any weight on the words appearing in link description text, as above.

According to Peterman, Timothy P:
 I think setting "title_factor" to 0 in the config file will do that.
 You'll probably need to reindex for that change to take effect.
 
 Loys Masquelier wrote:
  
  Hello,
  
  I have a problem in indexing a file hierarchy. Htdig by default indexes
  all the names of all the files. When I search for a word, if that word
  is found in a file name, htsearch return the file path. But I only want
  files which contain the given word.
  
  Is there a way to avoid that file name indexing ?
  
  Thanks in advance.
  
  Best regards.
  
  Loys.
 
 -- 
 Tim Peterman - Web Master,
 ITP Unix Support Group Technical Lead
 Lockheed Martin EIS/NESS, Moorestown, NJ
 
 
 To unsubscribe from the htdig mailing list, send a message to
 [EMAIL PROTECTED]
 You will receive a message to confirm this.
 List archives:  http://www.htdig.org/mail/menu.html
 FAQ:http://www.htdig.org/FAQ.html
 


-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Problem with PDF files....

2001-01-16 Thread Gilles Detillieux


According to Elijah Kagan:
 Gilles,
 
 I greatly appreciate your help! Thanks!
 
 There are two parameters in Apache config file that tell it to add a
 charset field by default. They are: AddDefaultCharset and
 AddDefaultCharsetName. The first one should be set to off to prevent
 Apache from replying with a charset field set after the content type.
 
 After disabling AddDefaultCharset htdig worked as expected.
 
 Thanks again,
 
 Elijah

That's good to know.  However, I'd like to know if the 2nd patch I sent
you yesterday fixes the problem, even with AddDefaultCharset enabled.
If it's not too much bother, would you mind giving it a try and letting
me know?  Thanks.

 On Mon, 15 Jan 2001, Gilles Detillieux wrote:
 
  According to Elijah Kagan:
   I run htdig 3.1.5.
   I tried both the Debian package and a compiled one with the same result.
   I am absolutely sure there is something stupid I forgot to put into the
   configuration.
   
   Attached is the config file.
   
   Thanks for your help.
   
   Elijah
   
   
   On Fri, 12 Jan 2001, Gilles Detillieux wrote:
   
According to Elijah Kagan:
 1. I run htdig with an explicit -c option, so it uses the correct conf
 file.
 2. I rewrote the external_parsers so it includes only one line...
 3. ..and it is the first line in the file
 
 Results are the same! It is still looking for an acroread!
 
 Please, help. I am getting desperate...

Hmm.  You're sure you're running version 3.1.5 of htdig, and you
don't have a pre-3.1.4 binary of htdig kicking around that you might be
unknowingly running instead?  External converter support was added to the
external_parsers attribute only in version 3.1.4 and above.  If you're
sure this isn't the problem either, please send me a copy of your conf
file as it stands now (preferably uuencoded right on your htdig box to
prevent e-mail mangling of it), and I'll have a look and try a test or two.

Oh, another thing.  You mentioned this was on a Debian system.  Did you
compile htdig yourself, or did you use a pre-compiled binary?  If the
latter, which one?
  
  OK, it took a while, but the light finally came on!  If you look up the
  following thread on the mailing list archives:
  
  http://www.htdig.org/mail/2000/09/index.html#75
  
  you'll see that the bug has come up before.  I think there's something
  about the Debian configuration for Apache that causes it to add the
  "; charset=..." string to the Content-Type header, which is the source
  of the problem here.  At least I strongly suspect it must be the same
  problem, as I can't see anything else that would explain the behaviour
  you're reporting.  If you run htdig -vvv -i -c ..., you can then look
  at the header lines returned by your server for the PDF files, and see
  if the Content-Type header does indeed have something on the line after
  the application/pdf string.
  
  Geoff and I made some hacks to ExternalParser.cc in the 3.2.0b3
  development code to address this, but none of this has been backported
  to 3.1.5 yet.  I'll see if I can backport some or all of the external
  parser patches to 3.1.5 in the next day or two.  In the meantime,
  you can try working around this either by using local_urls, if you're
  running htdig on the same machine as your Apache server, or by using
  the same hack that Klaus used, i.e. add a line like the following to
  your external_parsers definition.
  
  "application/pdf; charset=iso-8859-1-text/html" 
/usr/share/htdig/conv_doc.pl
  


-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Problem with PDF files....

2001-01-15 Thread Gilles Detillieux


According to Elijah Kagan:
 I run htdig 3.1.5.
 I tried both the Debian package and a compiled one with the same result.
 I am absolutely sure there is something stupid I forgot to put into the
 configuration.
 
 Attached is the config file.
 
 Thanks for your help.
 
 Elijah
 
 
 On Fri, 12 Jan 2001, Gilles Detillieux wrote:
 
  According to Elijah Kagan:
   1. I run htdig with an explicit -c option, so it uses the correct conf
   file.
   2. I rewrote the external_parsers so it includes only one line...
   3. ..and it is the first line in the file
   
   Results are the same! It is still looking for an acroread!
   
   Please, help. I am getting desperate...
  
  Hmm.  You're sure you're running version 3.1.5 of htdig, and you
  don't have a pre-3.1.4 binary of htdig kicking around that you might be
  unknowingly running instead?  External converter support was added to the
  external_parsers attribute only in version 3.1.4 and above.  If you're
  sure this isn't the problem either, please send me a copy of your conf
  file as it stands now (preferably uuencoded right on your htdig box to
  prevent e-mail mangling of it), and I'll have a look and try a test or two.
  
  Oh, another thing.  You mentioned this was on a Debian system.  Did you
  compile htdig yourself, or did you use a pre-compiled binary?  If the
  latter, which one?

OK, it took a while, but the light finally came on!  If you look up the
following thread on the mailing list archives:

http://www.htdig.org/mail/2000/09/index.html#75

you'll see that the bug has come up before.  I think there's something
about the Debian configuration for Apache that causes it to add the
"; charset=..." string to the Content-Type header, which is the source
of the problem here.  At least I strongly suspect it must be the same
problem, as I can't see anything else that would explain the behaviour
you're reporting.  If you run htdig -vvv -i -c ..., you can then look
at the header lines returned by your server for the PDF files, and see
if the Content-Type header does indeed have something on the line after
the application/pdf string.

Geoff and I made some hacks to ExternalParser.cc in the 3.2.0b3
development code to address this, but none of this has been backported
to 3.1.5 yet.  I'll see if I can backport some or all of the external
parser patches to 3.1.5 in the next day or two.  In the meantime,
you can try working around this either by using local_urls, if you're
running htdig on the same machine as your Apache server, or by using
the same hack that Klaus used, i.e. add a line like the following to
your external_parsers definition.

"application/pdf; charset=iso-8859-1-text/html" 
/usr/share/htdig/conv_doc.pl

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Phrases

2001-01-15 Thread Gilles Detillieux


According to Bill Vick:
 We have tried both the current and beta versions and
 are having problems getting the phrase search to work
 correctly and consistently. Any patches or should we
 hang tight for the next version?

What to you mean by current?  If you mean the current stable release,
3.1.5, it does not support phrase searching, as explained in FAQ 1.9.
The 3.2.0b2 beta, the last one released, has a number of known bugs.
The upcoming 3.2.0b3 beta should be much more reliable than the last
beta.  You can wait for it, or you can try the latest development
snapshot of it...

   http://www.htdig.org/files/snapshots/htdig-3.2.0b3-011401.tar.gz

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Problems compiling 3.20b2

2001-01-15 Thread Gilles Detillieux


According to Richard van Drimmelen:
 I'm trying to compile 3.20b2 on a Sparc Solaris 7 machine with gcc
 2.95.2
 
 During 'make':
 
 ld: warning: symbol `Object type_info node' has differing alignments:
 (file Endings.o value=0x8; file ../htlib/libht.a(StringMatch.o)
 value=0x4);
 largest value applied
 Undefined   first referenced
  symbol in file
 __eh_pc Endings.o
 ld: fatal: Symbol referencing errors. No output written to htfuzzy
 collect2: ld returned 1 exit status
 make[1]: *** [htfuzzy] Error 1
 
 Any suggestions ?

I can't say for sure that the next beta will solve this problem, but could
you please try the latest development snapshot of it to see if it does?
The 3.2.0b2 beta has a number of known bugs and many compilation problems
that are fixed in the upcoming 3.2.0b3 beta.  You can try the latest
development snapshot of it at...

   http://www.htdig.org/files/snapshots/htdig-3.2.0b3-011401.tar.gz

In either case, let us know whether or not it solves this problem, so we
can know if it still needs fixing before releasing it.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

[htdig] PATCH: backport ExternalParser.cc from 3.2.0b3 to 3.1.5

2001-01-15 Thread Gilles Detillieux


According to Elijah Kagan:
 I run htdig 3.1.5.
 I tried both the Debian package and a compiled one with the same result.
 I am absolutely sure there is something stupid I forgot to put into the
 configuration.

OK, after getting to the bottom of this (I think!), I have backported
the 3.2.0b3 development code for htdig/ExternalParser.cc to version
3.1.5, to fix this and other problems.  Please give this patch file
a try and let me know if it works.  You will probably get a warning
about the wait() function being implicitly declared, unless you manually
define HAVE_WAIT_H or HAVE_SYS_WAIT_H (depending on whether your system
has wait.h or sys/wait.h).  Also, if your system has the mkstemp()
function, you may want to define HAVE_MKSTEMP manually as well, as this
will enhance security.  I didn't have time to figure out how to patch
aclocal.m4 and configure to add tests for all of these.

The patch fixes the following problems in external_parsers support in
3.1.5:
  - it got confused by "; charset=..." in the Content-Type header,
as described in "http://www.htdig.org/mail/2000/09/index.html#75".
  - security problems with using popen(), and therefore the shell,
to parse URL and content-type strings from untrusted sources
(now uses pipe/fork/exec instead of popen) - PR#542, PR#951.
  - used predictable temporary file name, which could be exploited
via symlinks - fixed if mkstemp() exists  HAVE_MKSTEMP is defined.
  - binary output from an external converter could get mangled.
  - error messages were sometimes ambiguous or missing altogether.
  - didn't open temporary file in binary mode for non-Unix systems
(attempts were made to fix this, but it's not clear yet whether
 the security fixes and pipe/fork/exec will port well to Cygwin).

Here's the patch, which you can apply in the main source directory for
htdig-3.1.5 using "patch -p0  this-file":

--- htdig/ExternalParser.cc.origThu Feb 24 20:29:10 2000
+++ htdig/ExternalParser.cc Mon Jan 15 13:18:47 2001
@@ -1,14 +1,24 @@
 //
 // ExternalParser.cc
 //
-// Implementation of ExternalParser
-// Allows external programs to parse unknown document formats.
-// The parser is expected to return the document in a specific format.
-// The format is documented in http://www.htdig.org/attrs.html#external_parser
+// ExternalParser: Implementation of ExternalParser
+// Allows external programs to parse unknown document formats.
+// The parser is expected to return the document in a 
+// specific format. The format is documented 
+// in http://www.htdig.org/attrs.html#external_parser
 //
-#if RELEASE
-static char RCSid[] = "$Id: ExternalParser.cc,v 1.9.2.3 1999/11/24 02:14:09 grdetil 
Exp $";
-#endif
+// Part of the ht://Dig package   http://www.htdig.org/
+// Copyright (c) 1995-2001 The ht://Dig Group
+// For copyright details, see the file COPYING in your distribution
+// or the GNU Public License version 2 or later
+// http://www.gnu.org/copyleft/gpl.html
+//
+// $Id: ExternalParser.cc,v 1.9.2.4 2001/01/15 13:18:47 grdetil Exp $
+//
+
+#ifdef HAVE_CONFIG_H
+#include "htconfig.h"
+#endif /* HAVE_CONFIG_H */
 
 #include "ExternalParser.h"
 #include "HTML.h"
@@ -19,9 +29,18 @@ static char RCSid[] = "$Id: ExternalPars
 #include "QuotedStringList.h"
 #include "URL.h"
 #include "Dictionary.h"
+#include "good_strtok.h"
+
 #include ctype.h
 #include stdio.h
-#include "good_strtok.h"
+#include unistd.h
+#include stdlib.h
+#include fcntl.h
+#ifdef HAVE_WAIT_H
+#include wait.h
+#elif HAVE_SYS_WAIT_H
+#include sys/wait.h
+#endif
 
 static Dictionary  *parsers = 0;
 static Dictionary  *toTypes = 0;
@@ -32,9 +51,18 @@ extern StringconfigFile;
 //
 ExternalParser::ExternalParser(char *contentType)
 {
+  String mime;
+  int sep;
+
 if (canParse(contentType))
 {
-   currentParser = ((String *)parsers-Find(contentType))-get();
+String mime = contentType;
+   mime.lowercase();
+   sep = mime.indexOf(';');
+   if (sep != -1)
+ mime = mime.sub(0, sep).get();
+   
+   currentParser = ((String *)parsers-Find(mime))-get();
 }
 ExternalParser::contentType = contentType;
 }
@@ -89,6 +117,8 @@ ExternalParser::readLine(FILE *in, Strin
 int
 ExternalParser::canParse(char *contentType)
 {
+  int  sep;
+
 if (!parsers)
 {
parsers = new Dictionary();
@@ -97,7 +127,6 @@ ExternalParser::canParse(char *contentTy
QuotedStringListqsl(config["external_parsers"], " \t");
String  from, to;
int i;
-   int sep;
 
for (i = 0; qsl[i]; i += 2)
{
@@ -109,11 +138,22 @@ ExternalParser::canParse(char *contentTy
to = from.sub(sep+2).get();
from = from.sub(0, sep).get();
}
+   from.lowercase();
+   sep = from.indexOf(';');
+   if (sep != -1)

Re: [htdig] make error on solaris 2.6

2001-01-15 Thread Gilles Detillieux


According to Ronald Edward Petty:
 When I was doing make I got this error for DocumentDB.cc and I did a work
 around doing this, but then I type make again and it gets past
 DocumentDB.cc and does this for the next file... Is there something wrong
 with my shell or something...  I dont feel like typing
 #!/usr/bin/tcsh
 
 setenv BIN_DIR /export/netapp/user/rpy/htdig/bin
 setenv DCOMMON_DIR "/export/netapp/user/rpy/htdig/common"
 setenv DCONFIG_DIR "/export/netapp/user/rpy/htdig/conf"
 setenv DATABASE_DIR "/export/netapp/user/rpy/htdig/db"
 setenv IMAGE_URL_PREFIX "/export/netapp/user/rpy/htdig/images"
 setenv PDF_PARSER "/usr/local/bin/acroread"
 setenv SORT_PROG "/bin/sort"
 setenv DEFAULT_CONFIG_FILE "/export/netapp/user/rpy/htdig/conf/htdig.conf"
 
 
 c++ -c -DBIN_DIR -DCOMMON_DIR -DCONFIG_DIR -DDATABASE_DIR
 -DIMAGE_URL_PREFIX -DPDF_PARSER -DSORT_PROG -DDEFAULT_CONFIG_FILE
 -I../htlib -I../ht
 common -I../db/dist -I../include -g -O2 DocumentDB.cc
 
 
 -
 Any idea why this top thing worked but the other doesn't
 -
 
 
 ares:/export/netapp/user/rpy/htdig-3.1.5/ make
 make[1]: Entering directory `/export/netapp/user/rpy/htdig-3.1.5/db/dist'
 make[1]: Nothing to be done for `all'.
 make[1]: Leaving directory `/export/netapp/user/rpy/htdig-3.1.5/db/dist'
 make[1]: Entering directory `/export/netapp/user/rpy/htdig-3.1.5/htlib'
 make[1]: Nothing to be done for `all'.
 make[1]: Leaving directory `/export/netapp/user/rpy/htdig-3.1.5/htlib'
 make[1]: Entering directory `/export/netapp/user/rpy/htdig-3.1.5/htcommon'
 c++ -c -DBIN_DIR=\"/export/netapp/user/rpy/htdig/bin\"
 -DCOMMON_DIR=\"/export/netapp/user/rpy/htdig/common\"
 -DCONFIG_DIR=\"/export/netapp/user/rpy/htdig/conf\"
 -DDATABASE_DIR=\"/export/netapp/user/rpy/htdig/db\"
 -DIMAGE_URL_PREFIX=\"/export/netapp/user/rpy/htdig/images \"

   ^
   |
I think the problem is right here. +
There seems to be a space (or maybe a control character) in your definition
for the IMAGE_URL_PREFIX, which is messing things up.

 -DPDF_PARSER=\"/usr/local/bin/acroread\" -DSORT_PROG=\"/bin/sort\"
 -DDEFAULT_CONFIG_FILE=\"/export/netapp/user/rpy/htdig/conf/htdig.conf\"
 -I../htlib -I../htcommon -I../db/dist -I../include -g -O2 DocumentRef.cc
 c++: ": No such file or directory
 DocumentRef.cc:0: unterminated string or character constant
 DocumentRef.cc:0: possible real start of unterminated constant
 make[1]: *** [DocumentRef.o] Error 1
 make[1]: Leaving directory `/export/netapp/user/rpy/htdig-3.1.5/htcommon'
 make: *** [all] Error 1
 ares:/export/netapp/user/rpy/htdig-3.1.5/

By the way, I think you may be misunderstanding what the IMAGE_URL_PREFIX
is supposed to be.  It's supposed to be an URL path, relative to the
DocumentRoot of your web server, not relative to your system's root
directory.  It's the IMAGE_DIR that is relative to the system's root
directory, but it must point to a directory that will be somewhere under
the DocumentRoot, so that the installed image files can be accessed by
web clients.

E.g., on my system, IMAGE_DIR is set to /home/httpd/html/htdig, and
my Apache configuration sets DocumentRoot to /home/httpd/html, so my
IMAGE_URL_PREFIX is simply "/htdig".

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] NEED HELP with indexing

2001-01-15 Thread Gilles Detillieux


On Mon, 15 Jan 2001, George Roberts wrote:
  I'm completely new to this software, but inherited a large site which
  uses it.  I made a simple change to some javascript on one of the
  indexed pages, and I have NO CLUE how to reindex the whole site.  Could
  someone please help?

According to Geoff Hutchison:
 It depends a lot on how the original person installed it and your system.
 But usually there's a program "rundig" that creates the databases. Many
 people just hack this script to fit local needs, others create local
 versions (e.g. mine is "rundig.sh"). Of course the best thing to do is to
 write a version that can be run through the cron program which ensures the
 indexes are updated on a regular basis automatically. But I digress.
 
 So first, I'd suggest finding the directory containing the databases, e.g.
 
 locate db.wordlist
 
 Next, make a backup of the files in there. Then see if you can find the
 rundig script. If so, look for any evidence of a local version with a
 possibly newer date. If the rundig script you have mentions "alt"
 somewhere in it, try running "rundig -a" which will update the databases
 using alternate .work files.
 
 That should get you started in the right direction.

But bear in mind that htdig does not index JavaScript, so your
changes to the JavaScript on one of the indexed pages may not
have any effect at all on searches even after you reindex.
See http://www.htdig.org/FAQ.html#q5.18

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

[htdig] PATCH correction: backport ExternalParser.cc from 3.2.0b3 to 3.1.5

2001-01-15 Thread Gilles Detillieux


I discovered some problems with the argument handling in the patch I posted
earlier today.  Please ignore that one and apply this one instead...

According to Elijah Kagan:
 I run htdig 3.1.5.
 I tried both the Debian package and a compiled one with the same result.
 I am absolutely sure there is something stupid I forgot to put into the
 configuration.

OK, after getting to the bottom of this (I think!), I have backported
the 3.2.0b3 development code for htdig/ExternalParser.cc to version
3.1.5, to fix this and other problems.  Please give this patch file
a try and let me know if it works.  You will probably get a warning
about the wait() function being implicitly declared, unless you manually
define HAVE_WAIT_H or HAVE_SYS_WAIT_H (depending on whether your system
has wait.h or sys/wait.h).  Also, if your system has the mkstemp()
function, you may want to define HAVE_MKSTEMP manually as well, as this
will enhance security.  I didn't have time to figure out how to patch
aclocal.m4 and configure to add tests for all of these.

The patch fixes the following problems in external_parsers support in
3.1.5:
  - it got confused by "; charset=..." in the Content-Type header,
as described in "http://www.htdig.org/mail/2000/09/index.html#75".
  - security problems with using popen(), and therefore the shell,
to parse URL and content-type strings from untrusted sources
(now uses pipe/fork/exec instead of popen) - PR#542, PR#951.
  - used predictable temporary file name, which could be exploited
via symlinks - fixed if mkstemp() exists  HAVE_MKSTEMP is defined.
  - binary output from an external converter could get mangled.
  - error messages were sometimes ambiguous or missing altogether.
  - didn't open temporary file in binary mode for non-Unix systems
(attempts were made to fix this, but it's not clear yet whether
 the security fixes and pipe/fork/exec will port well to Cygwin).

Here's the patch, which you can apply in the main source directory for
htdig-3.1.5 using "patch -p0  this-file":

--- htdig/ExternalParser.cc.origThu Feb 24 20:29:10 2000
+++ htdig/ExternalParser.cc Mon Jan 15 17:16:50 2001
@@ -1,14 +1,24 @@
 //
 // ExternalParser.cc
 //
-// Implementation of ExternalParser
-// Allows external programs to parse unknown document formats.
-// The parser is expected to return the document in a specific format.
-// The format is documented in http://www.htdig.org/attrs.html#external_parser
+// ExternalParser: Implementation of ExternalParser
+// Allows external programs to parse unknown document formats.
+// The parser is expected to return the document in a 
+// specific format. The format is documented 
+// in http://www.htdig.org/attrs.html#external_parser
 //
-#if RELEASE
-static char RCSid[] = "$Id: ExternalParser.cc,v 1.9.2.3 1999/11/24 02:14:09 grdetil 
Exp $";
-#endif
+// Part of the ht://Dig package   http://www.htdig.org/
+// Copyright (c) 1995-2001 The ht://Dig Group
+// For copyright details, see the file COPYING in your distribution
+// or the GNU Public License version 2 or later
+// http://www.gnu.org/copyleft/gpl.html
+//
+// $Id: ExternalParser.cc,v 1.9.2.4 2001/01/15 17:16:50 grdetil Exp $
+//
+
+#ifdef HAVE_CONFIG_H
+#include "htconfig.h"
+#endif /* HAVE_CONFIG_H */
 
 #include "ExternalParser.h"
 #include "HTML.h"
@@ -19,9 +29,18 @@ static char RCSid[] = "$Id: ExternalPars
 #include "QuotedStringList.h"
 #include "URL.h"
 #include "Dictionary.h"
+#include "good_strtok.h"
+
 #include ctype.h
 #include stdio.h
-#include "good_strtok.h"
+#include unistd.h
+#include stdlib.h
+#include fcntl.h
+#ifdef HAVE_WAIT_H
+#include wait.h
+#elif HAVE_SYS_WAIT_H
+#include sys/wait.h
+#endif
 
 static Dictionary  *parsers = 0;
 static Dictionary  *toTypes = 0;
@@ -32,9 +51,18 @@ extern StringconfigFile;
 //
 ExternalParser::ExternalParser(char *contentType)
 {
+  String mime;
+  int sep;
+
 if (canParse(contentType))
 {
-   currentParser = ((String *)parsers-Find(contentType))-get();
+String mime = contentType;
+   mime.lowercase();
+   sep = mime.indexOf(';');
+   if (sep != -1)
+ mime = mime.sub(0, sep).get();
+   
+   currentParser = ((String *)parsers-Find(mime))-get();
 }
 ExternalParser::contentType = contentType;
 }
@@ -89,6 +117,8 @@ ExternalParser::readLine(FILE *in, Strin
 int
 ExternalParser::canParse(char *contentType)
 {
+  int  sep;
+
 if (!parsers)
 {
parsers = new Dictionary();
@@ -97,7 +127,6 @@ ExternalParser::canParse(char *contentTy
QuotedStringListqsl(config["external_parsers"], " \t");
String  from, to;
int i;
-   int sep;
 
for (i = 0; qsl[i]; i += 2)
{
@@ -109,11 +138,22 @@ ExternalParser::canParse(char *contentTy
to = from.sub(sep+2).get();

Re: [htdig] Problem with PDF files....

2001-01-12 Thread Gilles Detillieux


According to Elijah Kagan:
 1. I run htdig with an explicit -c option, so it uses the correct conf
 file.
 2. I rewrote the external_parsers so it includes only one line...
 3. ..and it is the first line in the file
 
 Results are the same! It is still looking for an acroread!
 
 Please, help. I am getting desperate...

Hmm.  You're sure you're running version 3.1.5 of htdig, and you
don't have a pre-3.1.4 binary of htdig kicking around that you might be
unknowingly running instead?  External converter support was added to the
external_parsers attribute only in version 3.1.4 and above.  If you're
sure this isn't the problem either, please send me a copy of your conf
file as it stands now (preferably uuencoded right on your htdig box to
prevent e-mail mangling of it), and I'll have a look and try a test or two.

Oh, another thing.  You mentioned this was on a Debian system.  Did you
compile htdig yourself, or did you use a pre-compiled binary?  If the
latter, which one?

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] 3.1.3 engine on 3.1.5 db

2001-01-12 Thread Gilles Detillieux


According to Dave Salisbury:
  If
  you created your database with htdig 3.1.5, and want to search it with
  htsearch 3.1.3, that's a bad idea.  The most glaring bug in releases
  before 3.1.5 is in htsearch, so you really should upgrade it.
 
 I take it one of the worst things is the security hole which allows
 a user to view any file with read permissions ( ouch! )

That's the one!

 Is there any way to correct for this with a wrapper around htsearch?
 Reading the indices using 3.1.3 that were created by a 3.1.5 engine
 seems to work just fine.

There would be, but it might be a tad tricky.  The idea is to use a
backslash to quote any left quote (`), dollar sign ($) or backslash
(\) in the query string that is part of an input parameter value that
will get added to the config object as an internal attribute setting.
The lines in htsearch/htsearch.cc that do this are (from a grep):

config.Add("match_method", input["method"]);
config.Add("template_name", input["format"]);
config.Add("matches_per_page", input["matchesperpage"]);
config.Add("config", input["config"]);
config.Add("restrict", input["restrict"]);
config.Add("exclude", input["exclude"]);
config.Add("keywords", input["keywords"]);
config.Add("sort", input["sort"]);
config.Add(form_vars[i], input[form_vars[i]]);

The last one above is the tricky one, as it can be any input parameter
name that you use in allow_in_form.  Rather that limiting the backslash
escaping of special characters to only the values of these parameters,
it might be better to do the whole query string, but exclude a few
parameters where this might be undesirable.  I'd recommend NOT doing
this for the "words" input parameter, for instance, but I can't think
of any others right off-hand where you would not want to do this.

 Anyone out there want to bash Glimpse before I look into it.  
 I'm hoping to get it at least to compile on an SGI.

I won't do any bashing, but if htdig is your preference, I'd suggest not
giving up on it too quickly.  Did you have a look at David Adams' recent
post about an "IRIX compile fix"?  In it, he forwarded a message from
Bob MacCallum that explains a workaround to some problems on IRIX 6.5,
using cc, not gcc.  If you haven't already, you ought to try that before
abandoning htdig.

  On the other hand, if you have an existing database built with version
  3.1.3, and want to use it with the latest htsearch, that should work
  without any difficulty.  However, you'll lose out on several benefits
  in the latest htdig (better parsing of meta tags, parsing img alt text,
  fixed parsing of URL parameters, etc.), 
 
 Couldn't find what "fixed parsing of URL parameters" means.
 The query string is part of what's indexed??

The query string isn't indexed, but it's part of the URL.  3.1.3 mangled
bare ampersands () in the query string in an URL, and versions before
that didn't decode sequences like eacute; within an URL.  I think the
ChangeLog explains it better than the release notes.

Tue Nov 23 19:52:27 1999  Gilles Detillieux  [EMAIL PROTECTED]

* htdig/HTML.cc(transSGML), htdig/SGMLEntities.cc(translateAndUpdate):
    Fix the infamous problem in htdig 3.1.3 of mangling URL parameters that
contain bare ampersands (), and not converting amp; entities in URLs.
...
Wed Sep  1 15:39:41 1999  Gilles Detillieux  [EMAIL PROTECTED]

* htdig/HTML.h, htdig/HTML.cc(do_tag, transSGML): Fix the HTML parser
to decode SGML entities within tag attributes.

  which you'll only get if you
  reindex with htdig 3.1.5.  Maybe none of these matter for your site,
  though.  See the release notes and ChangeLog for details.
 
 I don't think they're essential.

Except for the URL parameter mangling fix, of course.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Performance problems with htdig 3.2.0b2

2001-01-12 Thread Gilles Detillieux


According to Mathias Rohland:
 I have a problem with the performance of htdig 3.2.0b2. I'm indexing 
 about +25.500 HTML-docs at the moment and it takes several (+8) hours 
 to index them on a machine that's not to busy with outher tasks (PII 233
 w/ 512K Cache and 128MB RAM).
...
 I need to use htdig 3.2.0b2 as we need phrase searching and a second
 machine in another location that runs with solaris won't like 3.2.0b3.

3.2.0b3 is still a work in progress, but it already fixes a large number
of bugs in 3.2.0b2.  Try the latest snapshot of b3, and if you still
can't compile it on Solaris, please e-mail us at this list the output
of the configure and make runs.  I don't think it makes sense for us to
take time debugging an old beta version when the real problem here is
you can't build the latest beta pre-release.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] security hole (was: how to set the $(PERCENT)? -it always show 1%)

2001-01-12 Thread Gilles Detillieux


According to Edward Lu:
 Geoff,
 What is the security hole in version 3.1.5?
 It sounds scary. 

The security hole is in version BEFORE 3.1.5, and is fixed in 3.1.5.  It
allowed a user to snoop through any file on your web server's file system,
as long as it was readable by the user ID under which the web server process
runs, just by passing it a special query string in the htsearch URL.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

[htdig] Re: any suggestions for using 3.1.5 or 3.2.0b2?

2001-01-12 Thread Gilles Detillieux


According to Edward Lu:
 According to the release note for htdig-3.2.0b2. It added more functionality
 and fixed all known bugs after 3.1.5
 But apparently it still has the relevance ($(PERCENT)) bug and not stable
 enough. 
 I am asking for any suggestions about which version (3.1.5 or 3.2.0b2)
 should be used for our company web site. 
 Any experience about the advantage and disadvantage of both the versions?
 
 Any suggestions will be greatly appreciated.
 
 -Edward

It's correct that 3.2.0b2 fixed many known bugs in 3.1.5, but none of
these were earth-shattering problems.  There were many limitations,
though, in the 3.1.x series that required a pretty radical redesign of
many components.

While 3.2.0b2 did fix some bugs, it introduced a whole lot because of
the large number of redesigned/rewritten components.  That's why 3.2 is
still in beta.  The latest 3.2.0b3 pre-release source snapshot fixes a
lot of the 3.2.0b2 bugs, but there are still some that remain.

If you need the features of 3.2, then use the b3 snapshots, not the
b2 release.  If you don't need these features, and the limitations of
3.1 aren't a problem for you, then you'd be wise to stick to 3.1.5 for
a production system until 3.2 gets a bit more of a shakeout.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] keep temp files while running indexer? How to...

2001-01-11 Thread Gilles Detillieux


According to Stephen Murray:
 Hi Gilles,
 
 When you wrote:
 
 "Only
 
  the one on the contrib section of the FTP site and web site is
 
  current."
 
 You were referring to rundig.sh at http://www.htdig.org/contrib/ -
 
 -- right? That's the one I should use? (As Geoff suggested?)

I answered this yesterday evening, after the first time you asked, but
here goes again...

  Yes, it's the Scripts sub-section of that part of the web site, which
  actually takes you to the http://www.htdig.org/files/contrib/scripts/
  directory.

To clarify further, the one you should NOT use is the one in the contrib
directory of the htdig-3.1.5.tar.gz source distribution, or any other
source distribution, as these are the ones that are outdated.

Using the URL you mentioned in your e-mail above will get you to the
correct script in two clicks.  First, the link "Scripts" in the left
frame, then the link "rundig.sh" in the right frame.  Using the URL
I mentioned in my reply above will get you there in one click.  Either
way, it's the same directory and the same file, presented either with
or without the frame structure on the web page.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] using perl/cron to find badwords on site

2001-01-11 Thread Gilles Detillieux


According to Jerry Preeper:
 I don't know if anyone else has run across this yet, but I have a number of
 guestbooks and things like that where people can post and I would love to
 be able to find a way to set up a daily cron job with perl script that
 basically runs a set of badwords through htsearch and then emails me a list
 of just the urls it finds with those words in it... I don't really need
 things like the page title or description or stuff like that..  I'm
 assuming I'll need to use a system call in the script to some sort of
 command line option and loop it for each word...  Any input would be
 greatly appreciated.

I assume that you want your htdig database updated through this same
cron job, before running htsearch, so that the database you search will
contain any new postings to the guestbooks.  The simplest way I can
think of, assuming the correct settings are already made in htdig.conf,
would be a shell script with these commands...

  htdig
  htmerge
  /path/to/cgi-bin/htsearch "words=badword1+badword2+badword3+badword4"

Of course, if you want to write it in Perl, especially if you need more
processing than simply running these programs, you can call the above
commands in one or more calls to the system("...") function in Perl.

You may want to customise the htsearch templates to get just the URL,
if that's all you want (see template_map, search_results_header and
search_results_footer in http://www.htdig.org/attrs.html).  If you want
to search for each word separately, rather than one query for all words,
then you'd need to call htsearch once for each individual word.  E.g. in
a shell script, you could do:

  htdig; htmerge
  for word in badword1 badword2 badword3 badword4
  do
echo "${word}:"
/path/to/cgi-bin/htsearch "words=${word}"
  done

or:

  htdig; htmerge
  while read word
  do
echo "${word}:"
/path/to/cgi-bin/htsearch "words=${word}"
  done  /path/to/bad-word-file

However, it seems to me it would be better to search for all at once,
unless you need a word by word summary of URLs.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] External Converter Prob

2001-01-11 Thread Gilles Detillieux


According to Reich, Stefan:
 all my descriptions are starting with "content-type: text/html".
 
 Is this normal behavior or is it, because I'm using an external converter to
 do some modifications on the spidered html files. I registered my converter
 for text/html - text/myhtml conversion. I've patched the html parser to
 recognize this in addition to text/html. 
 
 I'm sure my external converter doesn't write text/html to the output stream.
 
 Any ideas?

No, this is not normal behaviour.  If you're certain that your external
converter doesn't write this out, then we'd have to assume it comes
from elsewhere.  It may be a stupid question, but are you sure the pages
you're indexing don't contain this extra header?  I've seen defective
CGI scripts, for example, that inadvertantly output two such headers in
some situations.  Ditto for SSI pages that call CGI scripts incorrectly.
Finally, it's hard to be sure it isn't a problem with your patches
to htdig, or to your particular configuration, without being able to
see them.  I don't know if this helps or not, but it may give you a few
more places to look.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Unable to contact server-revisisted

2001-01-11 Thread Gilles Detillieux


According to Roger Weiss:
 I'm running htdig v3.1.5 and my digging seems to be running out of steam
 after it runs for anywhere from 20 minutes to an hour or so. The initial msg
 was "Unable to connect to server". So, I ran it again with -v v v   to get
 the error message below.
 
 pick: ponderingjudd.xxx.com, # servers = 550
 3213:3622:2:http://ponderingjudd.xxx.com/ponderingjudd/id6.html: Unable to
 build
  connection with ponderingjudd.xxx.com:80
  no server running
 
 I've replaced part of the URL with xxx to protect the innocent. The server
 certainly is running and I had no trouble finding the mentioned url. Is
 there some parm I need to set or limit I need to raise?
 We're running an apache server with startservers =25 and minspace=10.

I guess the next question, if you're sure the server is running, is can
you access it from a client?  More specifically, can you access it using a
different web client on the same system as the one on which you're running
htdig (e.g. from lynx, Netscape, kfm, or some other Linux/Unix-based web
browser)?  If you can, then the problem will be to figure out why htdig
can't build the connection while other programs on the same system can.
If you can't access the server from any client program on the same
system, then the problem isn't with htdig, but with your network setup
(e.g. firewall, packet filtering, or a bad connection from that system).

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] htdig

2001-01-11 Thread Gilles Detillieux


According to Geoff Hutchison:
 No regular expressions needed. You can limit URLs based on query patterns
 already. See the bad_querystr attribute:
 http://www.htdig.org/attrs.html#bad_querystr
...
 On Thu, 11 Jan 2001, Richard Bethany wrote:
  I'm the SysAdmin for our web servers and I'm working with Chuck (who does
  the development work) on this problem.  Here's the "nuts  bolts" of the
  problem.  Our entire web server is set up with a menuing system being run
  through PHP3.  This menuing system basically allows local documents/links to
  be reached via a URL off of the PHP3 file.  In other words, if I try to
  access a particular page it will be accessed as
  http://ourweb.com/DEPT/index.php3?i=1e=3p=2:3:4:.
  
  In this scenario the only relevant piece of info is the "i" value; the
  remainder of the info simply describes which portions of the menu should be
  displayed.  What ends up happening is that, for a page with eight(8) main
  menu items, 40,320 (8*7*6*5*4*3*2*1) different "hits" show up in htDig for
  each link!!  I essentially need to exclude any URL where "p" has more than
  one value (i.e. - p=1: is okay, p=1:2: is not).
  
  I've looked through the mailing list archives and found a great deal of
  discussion on the topic of regular expressions with exclusions and also some
  talk of stripping parts of the URL, but I've seen nothing to indicate that
  any of this has actually been implemented.  Do you know if there is any
  implementation of this?  If not, I saw a reply to a different problem from
  Gilles indicating that the URL::normalizePath() function would be the best
  place to start hacking so I guess I'll try that.

I guess the problem, though, is that without regular expressions it
could mean a large list of possible values that need to be specified
explicitly.  The same problem exists for exclude_urls as for bad_querystr,
as they're handled essentially the same way, the only difference being
that bad_querystr is limited to patterns occurring on or after the last
"?" in the URL.

So, if p=1: is valid, but p=[2-9].* and p=1:[2-9].* are not, then
the explicit list in bad_querystr would need to be:

bad_querystr:   p=2 p=3 p=4 p=5 p=6 p=7 p=8 p=9 \
p=1:2 p=1:3 p=1:4 p=1:5 p=1:6 p=1:7 p=1:8 p=1:9

It gets a bit more complicated if you need to deal with numbers of two
or more digits too, because then you can allow p=1: but not p=1[0-9]:,
so you'd need to include these patterns in the list too:

p=10 p=11 p=12 p=13 p=14 p=15 p=16 p=17 p=18 p=19 p=1:1

So, while it's not pretty, it is feasible provided the range of
possibilities doesn't get overly complex.  This will be easier in 3.2,
which will allow regular expressions.

I think my suggestion for hacking URL::normalizePath() involved much more
complicated patterns, and search-and-replace style substitutions based
on those patterns.  That may still be the way to go if you want to do
normalisations of patterns rather than simple exclusions, e.g. if you're
not guaranteed to hit a link to each page using a non-excluded pattern.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Problem with PDF files....

2001-01-11 Thread Gilles Detillieux


According to Elijah Kagan:
 
 Dear Everyone
 
 Hope this is the correct list to send such questions. If not, accept my
 apologies.
 
 When I run htdig on my files I get the following message when it comes to
 a PDF document:
 
 41:41:3:http://myserver/~elijah/document.pdf: PDF::parse: cannot find pdf
 parser /usr/local/bin/acroread  size = 1965732 
 
 For some reason htdig looks for an Acrobat while its config file clearly
 states:
 
 external_parsers: application/msword-text/html /usr/local/bin/conv_doc.pl \
   application/postscript-text/html /usr/local/bin/conv_doc.pl \
   application/pdf-text/html /usr/local/bin/conv_doc.pl
 
 The conv_doc.pl exists and working and the content type received from the
 server is application/pdf.
 
 Any ideas?
...
 P.S.  I am running htdig 3.1.5 on a Debian system.

There are a few possibilities:

1) htdig isn't looking at this config file, but another one, without
the external_parsers definition;
2) there's a typo in the external_parsers definition that isn't showing up 
in the text you e-mailed above, e.g. a misspelled word or a space after
one of the backslashes at the end of the first two lines; or
3) there's a definition right above your external_parsers definition that
mistakenly ends with a backslash at the end of the line, causing your
external_parsers definition to be swallowed up by the previous line.

That htdig is attempting to invoke acroread confirms two things:  a)
the PDF file is correctly being tagged by the server as application/pdf,
and b) htdig is not seeing a usable definition of an external parser
for that content-type, for any of the reasons outlined above.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] htdig

2001-01-11 Thread Gilles Detillieux


According to Richard Bethany:
 That was my fear as well.  For the one link below with eight menu items, I
 need to accept p=1: through p=8: to pick up any/all links in the submenus,
 but I would have to reject the other 40,312 possible combinations of values
 that "p" can have.  As you stated, that would be a mite cumbersome and, if
 we had pages with more menu items (we do), it would become exponentially
 more impossible (-- can something be "more" impossible?  How about more
 improbable?) to limit the accepted values.
 
 Does the 3.2 beta release seem pretty stable?  Does the regex functionality
 work properly?  If so, perhaps I'll give that a shot.  If not, I suppose
 I'll just dig around in the code to see if I can find a way to get it to do
 what we need.

The current 3.2 beta release (b2) isn't stable.  The latest development
snapshot for 3.2.0b3 is much more so, but IMHO still not quite ready
for prime-time.  Ironically, one of the remaining problems is that long,
complex regular expressions seem to be silently failing right now,
so we still need to get to the bottom of that.

However, even you you need to reject 40,312 possible combinations of
values, it doesn't mean you'd need to explicitly list each of those,
as many of them could be covered by the same substring.  The current
handling of exclude_urls and bad_querystr does substring matching, so
there's an implied .* on either side of each string you give for these
two attributes.  Because any of 1 though 8 can be used as the intial p=
value, it makes the problem more complicated than I assumed, but not
by a huge amount.  If I understand correctly, as long as there's only
one menu value specified, it's OK, but if there are two or more, it's
not OK, and only 1 through 8 will appear as possible menu values.  Now,
a string of more than two menu values will be matched by a substring of
only two values, so all you need are all possible series of two values,
or 8 x 8 = 64 patterns, p=1:1 through to p=8:8.  Correct?

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] PDFs, numbers, and percent signs

2001-01-10 Thread Gilles Detillieux


According to Philip E. Varner:
 1) The directive minimum_word_length defaults to 3, but when dealing with
 two-digit numbers, this should be set to two.  The default would catch
 "25%", but not other numbers.  This needs to be set in htdig.conf, AND in
 parse_doc.pl, if using it.  parse_doc.pl should probably be changed to
 read variables from htdig.conf at some point in time, but that's not my
 call.
 
 2) In additon to minimum_word_length, I added these attributes to
 htdig.conf
 
 allow_numbers: true
 extra_word_characters: %$#
 valid_punctuation: .-_/!^'
 
 By default, htdig ignores numbers, so I set it count them.  It also
 ignores most punctuation, so I allow the characters %$# since they are
 common pre/suffixes for numbers.  valid_punctuation then says what to
 ignore.  Also, these need to be accounted for in parse_doc.pl.
 
 3) The default for parse_doc.pl is to strip all punctuation, with the
 command
 
 tr{-\255._/!#$%^'}{}d;
 
 I changed this to
 
 tr{-\255._/!^'}{}d;
 
 to leave the punctuation I wanted.  However, this punctuation was still
 deleted because of the way the text is split() into and array.  I changed
 the command
 
 push @allwords, grep { length = $minimum_word_length } split/\W+/;
 
 to
 
 push @allwords, grep { length = $minimum_word_length } split /\s+/;
 
 \W matches anything that's not a word, which includes punctuation.  So,
 punctuation was still getting stripped out.  \s matches all whitespace,
 which is what I really want, since all "offending" punctuation was removed
 earlier.  This works for me, but might not work for everyone.
 
 4) I increased the limit on these two attributes, since PDF are larger, I
 only had a few dozen, and I wanted good matches.  This is probably not a
 good idea if you have a lot of files, though.
 
 max_head_length:50
 max_doc_size:   5000
 
 
 If anyone has any other suggestions, I'd like to hear about them.

Most of the problems you ran into could have easily been avoided if you
tossed parse_doc.pl into the bit bucket and used an external converter
like doc2html.pl or conv_doc.pl instead.

As you realised, external parsers don't read your config file attributes,
and it would mean making them extremely big and complicated, with a lot
of duplication of code, to get them to do this properly.  That's why
external parsers, in most cases, are a bad idea.  That's also why I
added external converter support back in version 3.1.4.  That way, you
just need a simple conversion to plain text or HTML, and all the gory
details of parsing the document in accordance with the users wishes are
handled internally by the text or HTML parser.

So, no, parse_doc.pl should not be changed to read the htdig.conf
attributes.  It should be given a decent burial and forgotten.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] keep temp files while running indexer? How to...

2001-01-10 Thread Gilles Detillieux


According to Geoff Hutchison:
 At 10:43 AM -0800 1/9/01, [EMAIL PROTECTED] wrote:
 1) Are we right in the assumptions we're making above (the  temp 
 files are being destroyed and are thus not available  during 
 indexing) and
 
 If you are not specifying the -a flag to htdig/htmerge then it will 
 modify the filenames specified in your htdig.conf. This would 
 probably not be what you want.
 
 2) if so, how do I change the conf file so that the temporary  files 
 will be available to the search engine during indexing - - so that 
 the search engine will still work during indexing?
 
 You might want to take a look at the rundig.sh script in the contrib/ 
 section (I'm pretty sure it's in the releases, but it's definitely on 
 the FTP server.)

The version in the release distributions, and even in the current snapshots,
is out of date, and still refers to a .gdbm database.  Only the one on the
contrib section of the FTP site and web site is current.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] keep temp files while running indexer? How to...

2001-01-10 Thread Gilles Detillieux


According to [EMAIL PROTECTED]:
 A couple questions:
 
 1) The file you're referring to is rundig.sh on 
 http://www.htdig.org/contrib/  (right?)

Yes, it's the Scripts sub-section of that part of the web site, which
actually takes you to the http://www.htdig.org/files/contrib/scripts/
directory.

 2) Does the file have to be modified for my system or can I use it 
 as is? (I know, dumb question)

Well, shell scripts don't read the htdig.conf file, so it may be that
some changes there will require corresponding changes in the scripts,
especially with regard to file and directory names.  Take a close look
at it.

 3) Does the file go in bin/rundig or in cgi-bin/rundig?

Only htsearch should go in cgi-bin.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Multiple domain names pointing on the same site

2001-01-08 Thread Gilles Detillieux


According to Malcolm Austen:
 On Mon, 8 Jan 2001 [EMAIL PROTECTED] wrote:
 + If a site can be reached via different domain names,
 + is there a trick to make htsearch generate result
 + links pointing to the domain name the user reached
 + the site with ?
 
 Check out the server_aliases: options. It does just what you want.
 
   server_aliases: a.com:80=b.com:80
 
 will result in references to a.com being treated as if they were
 references to b.com

Well, this is a start, but it's only part of the solution.  What this will
do is ensure that only the canonical server name, b.com in this example,
is used for entries in the database.  However...

Greg also wrote:
+ A user reaches the site a.com, makes a search, the result
+ would be www.a.com/searchedpage.html
+ If another user reaches the site b.com (wich is the same
+ document as a.com), the result would link to
+ www.b.com/searchedpage.html

This is tricker, as what you want is for a given, presumably static
database for all domains, to alter the search results' domain names to
match the domain name used in the URL that called htsearch.  I think this
would require a combination of server_aliases as above for canonicalising
the domain name, and url_part_aliases to encode the canonical domain in
the database.  Then, the search wrapper would figure out the domain name
used in the CGI URL, and pass that to the real htsearch which would use
it in its own url_part_aliases to decode the encoded canonical domain
into the desired domain name.

For example, in htdig and htmerge's htdig.conf:
server_aliases: www.a.com:80=www.real.com:80 \
www.b.com:80=www.real.com:80
url_part_aliases: www.real.com *site

Then in htsearch's config file:
url_part_aliases: ${searchdomain} *site
searchdomain: www.real.com
allow_in_form: searchdomain

Then, the search form would set the "config" input parameter to set this
particular search config file, and set the action to call a wrapper script
like this one, using the "GET" method:

-
#!/bin/sh

case "$QUERY_STRING" in
*searchdomain=*);;  # searchdomain is already set, so leave it
*)  # set searchdomain to HTTP host name used in request
QUERY_STRING="${QUERY_STRING}searchdomain=$HTTP_HOST"
export QUERY_STRING
;;
esac

exec /some/path/to/real/htsearch
-

I'm pretty sure this should work, because htsearch seems to parse
allow_in_form's value, and make its input parameters override the
corresponding config attributes, before the url_part_aliases value is
parsed by the HtURLCodec class.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

[htdig] Re: Enhancement request (PR#991)

2001-01-08 Thread Gilles Detillieux


According to [EMAIL PROTECTED]:
 Is it possible for you to add a feature to the config file to allow 
 custom information in anchor urls in excerpts.
 
 e.g. I would like to add the "target" attribute to the anchor urls so 
 that I can direct the matching url to another frame on the page.

A pretty quick and easy way of doing this would be to change this line in
htsearch/Display.cc's Display::hilight() method (line 1215 in unpatched
3.1.5 code):

result  "a href=\""  urlanchor  "\"";

to something like:

result  "a href=\""  urlanchor  "\" "
config["urlanchor_parameters"]  "";

Then, you could set this in your htsearch config file:

urlanchor_parameters:   target="body"

I can think of other, more powerful and flexible ways of doing this but
they'd involve much more complicated changes to the code.  This one should
do the job for you.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] 3.1.5 engine on 3.1.3 db

2001-01-08 Thread Gilles Detillieux


According to Dave Salisbury:
 From: Geoff Hutchison [EMAIL PROTECTED]
  But the root question is:
  Why are you having problems compiling 3.1.5 on IRIX?
 
 I posted this a day or two ago, with no responses.
 Any help would be appreciated!  IRIX 6.5 things go well until
 these errors show up.

Unfortunately, there aren't a lot of people on the list with IRIX experience,
so it's hard to make rapid headway on that front.

 make[1]: Entering directory `/home/salisbur/htdig-3.1.5/htfuzzy'
 g++ -o htfuzzy -L../htlib -L../htcommon -L../db/dist -L/usr/lib32 Endings.o 
EndingsDB.o Exact.o Fuzzy.o Metaphone.o Soundex.o
 SuffixEntry.o Synonym.o htfuzzy.o Substring.o Prefix.o ../htcommon/libcommon.a 
../htlib/libht.a ../db/dist/libdb.a -lnsl -lsocket
 ld32: WARNING 131: Multiply defined weak symbol:(Deserialize__6ObjectR6StringRi) in 
Endings.o and EndingsDB.o (2nd definition ignored).
...

It's hard to say for sure what's causing these warnings.  It seems perhaps
the override of virtual methods in the Object class with non-virtual ones
in the String class is causing this.  Maybe it's just because SGI's ld32
doesn't like the way g++ builds these objects.

 and on till a warning message limit is reached and then many errors like:
...
 ld32: Giving up after printing 50 warnings.  Use -wall to print all warnings.
 ld32: ERROR   33 : Unresolved text symbol "cout" -- 1st referenced by EndingsDB.o.
 Use linker option -v to see when and which objects, archives and dsos are 
loaded.
 ld32: ERROR   33 : Unresolved text symbol "__ls__7ostreamPCc" -- 1st referenced by 
EndingsDB.o.
 Use linker option -v to see when and which objects, archives and dsos are 
loaded.
...

Now these errors seem to be the result of ld32 not finding the required
C++ classes in the C++ system library.  Either it's not finding the
library at all (or not told where to find it), or the library is somehow
incompatible with the g++ compiler you have installed.  Given these errors,
I doubt your system could even compile and link a simple "Hello, World"
program in C++.  You'd need either to get to the bottom of this and fix
it, or you'd need to get 3.1.5 built on the system from which you got the
3.1.3 binaries.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Re: Enhancement request (PR#991)

2001-01-08 Thread Gilles Detillieux


According to Kapil Biyani:
 Instead of editing the .cc files from the source, just to add the target 
 param,
 I guess you can even change the long.html file in the $commondir
 
 All one has to do is enable the long file in the source and then edit the
 particular file and add parameters required to it. See below what to add
 in config file.
 
 You can see an working example of it at
 http://www.indiainfoline.com/search/
 
 I have infact added a complete line of code to it which open the results
 in a new frame which also has another frame where the search box is again
 displayed...(sounds confusing, check the site :-(

That fix works for the main search result URL shown by the result
template, but Stephen was asking about the anchor URLs that pop up
in the excerpt, when the first matched word appears after an anchor
tag in the source document, and when add_anchors_to_excerpt is true.
There's nothing you can do about these links in an unpatched htsearch,
because the HTML that generates them is hardcoded in the hilight() method.

If you want to use a frames-based setup for htsearch, where the text
from matched pages is displayed in a different frame than the htsearch
results (and their links to these matched pages), then you must use
target specifications for the main URL in the template as well as the
URLs in the excerpt, unless you disable add_anchors_to_excerpt.

As Stephen was asking only about the excerpts, I assumed he had figured
out about the templates, and didn't want to disable the anchors in
excerpts.

 Infact if you add the long.html file in your config file you can make it
 configurable to any extent you want...Here is what to add..
 -
  template_map: Long long ${common_dir}/long.html \
 Short short ${common_dir}/short.html
  template_name: long   
 --
 
 for more info. check the http://www.indiainfoline.com/search/ page...
 
 (*Hope I am not wrong, if I am then SORRY*)

No apologies necessary.  If you search for "name" on your site above,
you'll see a couple excerpts where the highlighted word is hyperlinked
to an anchor within the document.  That will illustrate what I'm talking
about.  Your template changes don't affect these links.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] [PB] reference count overflow

2001-01-05 Thread Gilles Detillieux


According to heddy Boubaker:
   "Gilles" == Gilles Detillieux [EMAIL PROTECTED] writes:
 
   I sometimes have errors like that when searching:
   DB2 problem...: ...intranet-db.words.db: page 0: reference count overflow
 
  Gilles My guess would be a corrupt database.  Try rebuilding it from
  Gilles scratch.
 
  Thanks, I solved the pb by doing that. 
  
  But what could corrupt the db ? The only actions I did on the concerned db
  was: htdig init then htmerge from scratch and then, once a week, htdig 
  htmerge are run again with the same config... If something corrupted my db it
  should be a bug somewhere no?

Potentially, yes.  We've received scattered reports of inexplicable
database corruption in the 3.1.x series, but never anything solid or
consistent enough that we could nail down to a specific bug.  We don't
know for sure even if it is a bug, but we suspect that it is, albeit an
obscure and infrequent one.  If the problem happens frequently and/or
consistently, please let us know and we can try to track it down.
Otherwise, all we can recommend is to rebuild the index from time to
time to correct this.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] $(NEXTPAGE) does not work in header

2001-01-05 Thread Gilles Detillieux


According to SMantscheff:
 --- start of header file ---
 pDie Suche nach i${WORDS}/i ergab folgende Resultate [Seite$ 
 {PAGE}/${PAGES}]:/p
 $(PREVPAGE)
 $(NEXTPAGE)
 --- end of header file ---
 
 It seems that the $(NEXTPAGE) variable does not work in the header files, 
 while $(PREVPAGE) does. I've got a search result with 4 pages, so both items 
 should appear. What am I missing?

Hard to say for sure.  Which version are you running, and what is the
value of maximum_pages in your config?  The logic htsearch uses is that
if the current page number is less than maximum_pages, it will create
a link for the next page, using the value of next_page_text as the
link description text.  If this string is empty, it could result in an
"invisible" link, i.e. the a href... tag and the /a tag with nothing
in between.  If the current page number is equal to maximum_pages, it
will set NEXTPAGE to the value of no_next_page_text, which commonly is
an empty string.  In other words, NEXTPAGE will normally be empty for
the last page of search results, while PREVPAGE will normally be empty
for the first page.  Check your values for all of these attributes above,
and take a good look at the resulting HTML output of htsearch.

Also, take a close look at your config file and templates for any typos.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Fw: [htdig] - Question for start_url and exclude_urls

2001-01-05 Thread Gilles Detillieux


According to "Mohai Wang" [EMAIL PROTECTED]:
  1. start_url:
 as long as start_url = "http://stagsite.coreon.com/download/". When I
 run
  "rundig -vvv log", I got error message from screen "DB2 problem...:
 missing
  or empty key value specified".  I also attached debug mode "log" and
  "htdig.conf" files, please take a look. Did I set wrong option?
  If start_url = "http://stagsite.coreon.com/" that it will go through to
  write index, because I only need to write everything under "download"
  nothing else.

The "missing or empty key value specified" error happens when the one
and only entry in the db.docdb database is deleted because the document
could not be fetched.  I.e. this is a symptom, and not the root cause of
the problem.  The root cause is very clearly indicated in your attached
"log.dat" file:

 0:0:0:http://stagesite.coreon.com/download/: Retrieval command for 
http://stagesite.coreon.com/download/: GET /download/ HTTP/1.0
 User-Agent: htdig/3.1.5 ([EMAIL PROTECTED])
 Host: stagesite.coreon.com
 
 Header line: HTTP/1.1 403 Forbidden

The 403 Forbidden error means htdig could not fetch the only document
specified in your start_url, i.e. the /download/ directory.  403 errors
are almost always the result of file permission problems.  The web
server's user ID does not have read permission (or search/execute
permission) on that directory, so no web client can access it from your
web server.  You'd almost certainly get the same error from your web
browser if you attempted to look at that directory from there using this
same URL.

  2. exclude_urls:
 I try to do something differently, start_url =
  "http://stagsite.coreon.com/" then I added exclude_urls = "/cgi-bin/
  /calendar/ /coreonlib/". When I run "rundig -vvv log3", it will read
  /coreonlib/ first then stop.  After I took off "coreonlib" from
 exclude_urls
  then rerun "rundig -vvv log2" that everything are indexing and reject
  "cgi-bin" and "calendar".  Could you tell me why?  Please take a look log3
  file.
...
 0:0:0:http://stagesite.coreon.com/: Retrieval command for 
http://stagesite.coreon.com/: GET / HTTP/1.0
 User-Agent: htdig/3.1.5 ([EMAIL PROTECTED])
 Host: stagesite.coreon.com
 
 Header line: HTTP/1.1 200 OK
 Header line: Date: Thu, 04 Jan 2001 16:27:48 GMT
 Header line: Server: Apache/1.3.12 (Unix) tomcat/1.0 mod_perl/1.24 mod_ssl/2.6.6 
OpenSSL/0.9.4
 Header line: Last-Modified: Tue, 12 Dec 2000 02:14:53 GMT
 Translated Tue, 12 Dec 2000 02:14:53 GMT to 2000-12-12 02:14:53 (100)
 And converted to Tue, 12 Dec 2000 02:14:53
 Header line: ETag: "48890-dc0-3a358a1d"
 Header line: Accept-Ranges: bytes
 Header line: Content-Length: 3520
 Header line: Connection: close
 Header line: Content-Type: text/html
 Header line: 
 returnStatus = 0
 Read 3520 from document
 Read a total of 3520 bytes
 
 title: Insite
 href: http://stagesite.coreon.com/coreonlib/html/top_index.htm ()
 
   Rejected: Item in the exclude list: item # 1 length: 11
 
 url rejected: (level 1)http://stagesite.coreon.com/coreonlib/html/top_index.htm
 href: http://stagesite.coreon.com/coreonlib/html/main.html ()
 
   Rejected: Item in the exclude list: item # 1 length: 11
 
 url rejected: (level 1)http://stagesite.coreon.com/coreonlib/html/main.html
  size = 3520
 pick: stagesite.coreon.com, # servers = 1
 htmerge: Sorting...
 htmerge: Merging...
 
 0/http://stagesite.coreon.com/

This log3.dat file doesn't look complete to me.  With the third level of
verbosity that you'd need to get detailed rejection messages like above,
I think you should be getting much more detail than that.  Is this
just an excerpt of the full log?  From what I can see above, it seems
that htdig is only picking up two links from your main index page, and
both are rejected.  This is what you want, according to your comments
above, because log3 is the result of running htdig with /coreonlib/
in exclude_urls.  The question is why does htdig not pick up and use any
other links, and I can't answer that if I don't have the complete log.
Does the complete log indicate more links than that, and if so, what
are the reasons for rejection?  If htdig doesn't see any links other
than those two, you need to find out why.  Are you expecting it to see
JavaScript links?  It won't!  See the FAQ (http://www.htdig.org/FAQ.html),
especially questions 5.25 and 5.27.  Perhaps htdig doesn't see any links
to the rest of your site on the main index page, but does find them
somewhere in coreonlib when you allow it to look there.  In this case,
you'd need to add something on your main index page that htdig can follow
to get to the rest of the site.

Also, please try to examine your logs more thoroughly, as errors like
the 403 error above shouldn't be dismissed so easily as inconsequential.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7

Re: [htdig] 3.1.5 engine on 3.1.3 db

2001-01-05 Thread Gilles Detillieux


According to Dave Salisbury:
 does anyone know if I can read the database created using
 3.1.5 with a 3.1.3 engine?
 ( just hoping to perhaps save some time before setting things up )
 
 I don't see anything in the release notes to indicate it can't be done.

The subject of your message and the question above seem to contradict
each other, so it's not clear in which direction you want to go.  If
you created your database with htdig 3.1.5, and want to search it with
htsearch 3.1.3, that's a bad idea.  The most glaring bug in releases
before 3.1.5 is in htsearch, so you really should upgrade it.

On the other hand, if you have an existing database built with version
3.1.3, and want to use it with the latest htsearch, that should work
without any difficulty.  However, you'll lose out on several benefits
in the latest htdig (better parsing of meta tags, parsing img alt text,
fixed parsing of URL parameters, etc.), which you'll only get if you
reindex with htdig 3.1.5.  Maybe none of these matter for your site,
though.  See the release notes and ChangeLog for details.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Stats

2001-01-05 Thread Gilles Detillieux


According to htdighelp:
 Is there any way of generating stats of what users are searching for?
 I assume just the standard web logs. Does anyone suggest anything that is very good 
at 
 not just stats but a clear picture of what users are doing.

There are a couple techniques you can use.  One is described at
http://www.htdig.org/attrs.html#logging which uses the syslog facility.
The other is to set up your search forms to use the GET method, rather
than POST, so that the query strings always appear in the web server logs.
Either way, you'll get raw data for every query htsearch processes,
and you can develop some scripts to summarise it any which way you want.
I don't know of any canned scripts to do this for you.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] no data

2001-01-04 Thread Gilles Detillieux


According to Paco Martinez:
  Without doing any changes in my Linux, when I execute /cgi-bin/htsearch, it
  appears tihs message...
 
  "The document contained no data."
  "Try again later, or contact ther server's adimnistrator."
 
  How can I solve it

This sounds like a problem with unreadable templates.  Make sure the
template files (common/*.html) are readable by the user ID under which
your web server runs, and than all directories leading up to and including
the common directory are searchable (executable) by this same user ID.

If that doesn't help, try running htsearch directly from the command line
to see if you can get results that way.  If you can, it's some sort of
web server configuration problem.  If you can't, then you'll need to
look a little deeper into why htsearch is failing.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] what does --host in configure do?

2001-01-04 Thread Gilles Detillieux


According to Foskett Roger:
 Hi, according to this post, http://www.htdig.org/mail/1998/12/0169.html 
 
 using ./configure --host=POSIX.2 sorts out a compile problem I am having on
 HPUX11.
 
 However, configure complains that 'POSIX.2' is invalid when used.  Can
 anyone please tell me what options can be specified for '--host' as I have
 been unable to find anything explaining it

Nor have I.  I'm not a configure expert, but as far as I can tell, the
--host option doesn't seem to do a whole lot.  If you're experiencing the
same sort of errors as in the message above, I'd suggest trying to set
the CFLAGS and CPPFLAGS environment variables before calling ./configure,
to tell the compiler or preprocessor where to find the files it needs.
See http://www.htdig.org/mail/2000/09/0206.html

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] htdig refuses to compile on HPUX11

2001-01-04 Thread Gilles Detillieux


According to Foskett Roger:
 Hi, I am having imense problems getting htdig 3.1.5 to build on HPUX-11.
 
 Ive tried this:
 
 CC='cc' \ 
 ./configure \
 --prefix=/opt/www/htdig \
 --host=POSIX.2
 
 with these compilers (tried them all in various combinations)
 CC='gcc'  or CC='cc'
 CPP='g++' or CPP='aCC'
 CXX='g++' or CXX='aCC'
 
 But keep getting the sort of errors below when running make (C stuff builds
 ok, but not the C++):
 
 gcc -c  -DDEFAULT_CONFIG_FILE=\"/opt/www/htdig/conf/htdig.conf\"
 -I../htlib -I../htcommon  -I../db/dist -I../include -O2 Substring.cc
 In file included from ../htlib/lib.h:22,
  from ../htlib/Object.h:23,
  from Fuzzy.h:24,
  from Substring.h:14,
  from Substring.cc:22:
 /usr/include/string.h:29: warning: declaration of `int memcmp(const void *,
 const void *, long unsigned int)'
 /usr/include/string.h:29: warning: conflicts with built-in declaration `int
 memcmp(const void *, const void *, unsigned int)'
 /usr/include/string.h:85: warning: declaration of `void * memcpy(void *,
 const void *, long unsigned int)'
 /usr/include/string.h:85: warning: conflicts with built-in declaration `void
 * memcpy(void *, const void *, unsigned int)'
 /usr/include/string.h:93: warning: declaration of `size_t strlen(const char
 *)'
 /usr/include/string.h:93: warning: conflicts with built-in declaration
 `unsigned int strlen(const char *)'
 as: "/var/tmp/ccWQJsCl.s", line 53: warning 36: Use of %fr21 is incorrect
 for the current LEVEL of 1.0
 as: "/var/tmp/ccWQJsCl.s", line 75: warning 36: Use of %fr20 is incorrect
 for the current LEVEL of 1.0
 as: "/var/tmp/ccWQJsCl.s", line 76: warning 36: Use of %fr19 is incorrect
 for the current LEVEL of 1.0
 
 Eventually, the whole thing falls over when it comes to the link stage.

This isn't the link stage, but the assembler stage.  Evidently your C++
compiler is generating incorrect code for your assmebler, "as".  Do you
get the same error when you set CXX to g++ or aCC?  I find it a bit odd
that you're using gcc to compile C++ code, but it would be interesting to
see how this changes when using a different front-end compiler than gcc.

 I have tried using '--host=POSIX.2' as suggested in this post
 http://www.htdig.org/mail/1998/12/0169.html but that doesnt seem to do
 anything (configure complains that it is invalid!?)  I have also tried using
 g++ and specifying CXXFLAGS='-lstdc++' but still no luck.
 
 Can anyone please help me on this, I am completely stuck.
 
 The weirdest thing is that I somehow managed to get it working once before,
 (but accidently wrote over the exe's)

This raises the obvious question of what has changed since you last got
it working.  Did you change any of your compilers at all?  Did you use
a different set of configure options before, and forgot what they were?

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Lost words

2001-01-04 Thread Gilles Detillieux


According to Tuomas Jormola:
 At my company we're trying to migrate from clumsy self-implemented
 search engine to htdig but it's not quite painless. The scenario is this:
 We've two databases on separate servers. One for public sites and one for
 intranet sites. The public database is only 4.2M and it has 7 sites indexed
 in it. The intra database is 690M with 4 sites/vhosts. Public search is
 accurate, fast and working great but intra search is causing troubles.
 
 For example, if you index a single site that contains lots of on-line
 manuals, the database is about 380M and word "aix" returns over 18000 hits.
 But when this site is indexed with the other intra sites, "aix" returns
 only 27 hits, most of them points to the on-line manual server as expected
 though. But where have thousands of the hits gone?
 
 So if these sites with gigabytes of content are indexed separately,
 the search is accurate but when the index is only one big db, a great
 amount of correct links is missed. Any guesses whether this is due to
 1) feature in htdig/htmerge and if so, is there a way to disable it?
 2) bug in htdig?
 3) bug in Berkely db?
 4) bad configuration?

Hard to say for sure.  As you're not using htmerge to build the one
big db from the separate, samller dbs, that rules out problems in the
merging code causing this problem.  Is the size of the big db roughly
equal to the sum of the sizes of the separate ones?  It could be an
obscure htdig or htmerge bug, or an AIX-specific problem.  This sure
isn't ringing any familiar bells, if that's what you're wondering.

 We're using htdig-3.1.5 and Berkeley db that was included in htdig archive
 running on AIX 4.3. htdig was compiled using  IBM VisualAge C++ Pro for
 AIX Version 5. And here's the list of configuration options that were
 changed against the default config (excluding options that are solely
 used to control the layout of htsearch):
 
 # to make searching of words with umlauts work
 locale: fi_FI
 # everything is valid :)
 valid_punctuation:  
 # to be able to search weird chars used in example scripts etc.
 extra_word_characters: @.-_/!#$%^'

OK, that's a pretty unusual use of the above two attributes.  Are you
aware that with these settings, the following 3 words will be treated
as separate and distinct words, and a search for one of them will not
find the other two?

aix-based aix aix.

However, I don't think that's the cause of the problem you're reporting,
if you're using the same settings for these attributes in all your
databases.

 # numbers, too, of course
 allow_numbers: true
 # exact matches only
 search_algorithm: exact:1
 
 
 BTW. Every test mentioned above was performed using a db built from
 the scratch with htdig/htmerge. Also it isn't due to erroneous restrict
 or exclude values. When talking about the size of the db, I mean the total
 size of all files in db directory. No support for optional algorithms were
 built using htfuzzy. The same config file was used in every test
 (well, database_dir and start_url were included from site-specific
 config file if only indexing a single site). htdig/htmerge reported
 no errors while creating each db and there's plenty of disk space.
...
 Oh I forgot to mention that one reason for this could be that frigging AIX,
 right? But I don't want to test htdig on my Linux desktop machine before
 everything else is tried at the actual server side.

We sure haven't tested htdig very thoroughly on AIX, so I would be
inclined to suspect a system-specific problem is at work here.  I think
testing your configurations on a Linux system would be a very good idea.
If the problem occurs there too, then it would point more surely to a
configuration problem or a bug.  If the problem doesn't occur on Linux,
then it's almost certainly an AIX-specific thing.  Either way, we'd need
more data to narrow it down.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] [PB] reference count overflow

2001-01-04 Thread Gilles Detillieux


According to heddy Boubaker:
 I sometimes have errors like that when searching:
 
 DB2 problem...: ...intranet-db.words.db: page 0: reference count overflow
 
 And of course it will generate no matches ... Any idea hom to solve that?
 
  htdig-3.1.5

My guess would be a corrupt database.  Try rebuilding it from scratch.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] questions

2001-01-04 Thread Gilles Detillieux


According to John Lunstroth:
 
 Hi - I am a beginner here and have some more questions. Apologies for
 taking up time on some of this.
 
 1. Ultimately I am interested in phrase and proximity search
 capabilities. I have been working at installing 3.1.5. I have read
 the release notes and see that the beta of 3.2 should be installed
 separately etc., since it uses different protocols, etc. I am wondering
 if I should just go ahead and start working with 3.2 - will it be
 difficult to upgrade between fixes?

That depends on how important phrase searching is to you.  There are
still a lot of bugs in the query parser in 3.2, and a rewrite of it
is slated for 3.2.0b4, so I don't know if the current phrase searching
will be adequate.  As far as upgrading between fixes, there's commonly
a bit more effort involved in working with beta releases, although for
some people they install quite smoothly.  There may be database format
changes coming down the road, so upgrading htdig may mean having to
reindex from scratch.  Switching between 3.1.5 and 3.2.0bX will certainly
require reindexing.

 2. Being new to the Lunix environment, I am just getting acquainted with
 the File Hierarchy Standards ideas, and am uncertain how important it is
 to follow based on the following. I have noticed that the FHS applies
 to administrators setting up systems, but I alos notice that my web
 host has other protocol in place for the section of the server I have
 access to. For example, the root of my server (I only have telnet/ftp
 access), has a /www directory that contains all of the websites, and a
 /home directory that is my home directory and contains all other home
 directories. I have real space in each subdirectory under my domain
 name - /www/myname/ and /home/myname. There is a link from /home to
 /www. This at first caused me some difficulty, but I got it figured out.
 
 The htdig configuration program/file assume that the website will be
 located under the server's /opt subdirectory - so configure by default
 produces files with this location: /opt/www. "opt" is the name of
 subdirectories in which the user should put non-system programs -
 their applications, if I udnerstand correctly. There are also the
 "var" and "bin" subdirectories.
 
 Is there a recommended file hierarchy I should use in the
 directory I have available? I am building my site in the
 /www/myname/ subdirectory. That is where cgi-bin is located
 (/www/myname/cgi-bin). Will it be easier, in the long run, to use a
 certain file system - I assume it will be, since htdig, and probably
 other apps, use a common base of standards I am unfamiliar with.

Don't worry too much about the FHS.  It's meant for people putting
together packages for distribution.  End users are not bound by it,
and individual system installations may go with something very different
in many cases.  Go with what works.  If your web hosting company imposes
a different hierarchy, the easiest thing in the long run is to go with
their setup as much as possible.  It's easy enough to configure htdig
to use any set of directories you want.

I think the whole /opt thing is a Sun-ism that may have been adopted
by some (but certainly not all) Linux distributions.  On Red Hat systems,
I go with something more FHS-like.

 I am asking a narrow question - what subdirectories would it be best
 to use in setting up htdig:
 
 /www/myname/opt (put htdig here as separate subdirectory
  /cgi-bin (htdig automatically puts stuff here
  /var (? not sure what to put here
  /bin (? not sure how to use this one - or
  even where it should be vis-a-vis htdig
  /htdocs (? is the name important or standard?
 
 the subdirectory "htdocs" - I assume this means hypertext docs -
 and should be where the content is - is "htdocs" a standard name,
 or an abbreviation used by the htdocs people?

htdocs is commonly used as the name for the DocumentRoot, but I don't
think there's any standard involved here.  Go with whatever your web host
uses as its document root directory, and put the "htdig" subdirectory that
contains the image files right in that directory.  Put htsearch in your
web host's cgi-bin directory if at all possible (and if one is provided),
to avoid having to specify a new ScriptAlias directory for CGI programs.
The rest of the files (executables, common/* files, database directory)
can go wherever you see fit, but make sure the common and database
directories are accessible by the web server's user ID.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a

Re: [htdig] HTDig indexes virtual host homepage only

2001-01-04 Thread Gilles Detillieux


According to Paul Broome:
 I have a problem with htdig indexing a virtual host on one of my web 
 servers - the other virtual hosts work fine. The server is a Sparc 
 running Debian Slink and HTDig 3.1.5. Under the virtual domain 
 http://www.ltbp.org, only the main page gets indexed. We have tried 
 everything we can think of, but without any joy.

See http://www.htdig.org/FAQ.html#q5.25 and try with one or two more -v
options to see why the links are not being followed.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] A little automation

2001-01-04 Thread Gilles Detillieux


You may want to have a look at ConfigDig:

http://configdig.sourceforge.net/

According to htdighelp:
 
 I agree and the fact is, if it takes a separate database, so be it, much simpler 
than 
 trying to mod the code for this. Maybe in the future, it might be a good idea. Maybe 
some 
 sort of web based config interface for all these things. Fact is, the net is not 
getting 
 any smaller and I suspect that those using this seriously will need more 
functionality.
 
 Mike
 
 
 At least IMO, true operational requirements for any such system
 would be quite user-specific.  The (full) set of user requirements would tend
 to include:
 
   Scheduling capabilities.
   Varying frequencies, perhaps even within the same URL.
   Inclusion and/or Exclusion of specified nodes, within a url.
   Statistical-recording capabilities.
   Varying underlying-database formats.


-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] PDF problems

2001-01-03 Thread Gilles Detillieux


According to The Melia Family:
 I am using HTDIG 3.1.5 on Redhat 7.0, and am having problems indexing PDF
 files. I have included my config  -vv output below.  I have no robots.txt
 file, and my max_doc_size is now 10M (one test .pdf file is only 27K and it
 also fails), as well as not rejecting pdf as an extension.
 I am using the latest xpdf with pdftotext, as well as the latest parse_doc
 and conv_doc scripts.
 
 I can manually pdftotext the pdf files and they do contain real text, not
 just images, I can also run parse_doc and conv_doc.plthey produce proper
 text.  WHen I do a rundig, I get a 'URL rejected' message, I do not know
 why, this (I presume) leads to a Deleted No Excerpt message and the file (or
 any pdf file) is not indexed.  Any suggestions??

The output from htdig isn't verbose enough to pinpoint the problems,
but there is more than one problem here.  First of all, I always strongly
recommend conv_doc.pl or doc2html.pl over parse_doc.pl.  The latter has
been the source of too many problems in the past.

Secondly, the rejected URLs and the "Deleted, no excerpt:" messages
are two unrelated issues.  URLs that are rejected by htdig at this
stage (level 1 or level 2) will not even be seen by htmerge.  For the
rejection of URLs, see http://www.htdig.org/FAQ.html#q5.27 for how to
deal with this.  There isn't enough information in the htdig output or
the excerpts of your htdig.conf you sent to be certain of what the reason
for rejection is.  However, the htdig output you sent seems to suggest
a different start_url value than the one in your htdig.conf excerpt, so
I suspect that the reason for the rejection is that the parent directory
of the one you're indexing is not in the limits of limit_urls_to, which
is a reasonable thing for a test case such as this.

The "Deleted, no excerpt:" messages are usually as a result of documents
that contain no indexable text, or external parsers that don't emit a
usable "h" record (one more reason to use an external converter rather
than an external parser).  The challenge is to get to the bottom of why
this happens in each individual case.  You did run the scripts manually,
which is what I usually recommend, but are you sure parse_doc.pl put out
a proper "h" record and not just "w" records?  Did you try htdig with
conv_doc.pl instead, using the correct syntax for external_parsers as
shown in conv_doc.pl's comments?

Finally, I noticed you're getting the directory indexed multiple times
due to Apache's fancy indexing feature.  You can avoid this by adding
"?D=A ?D=D ?M=A ?M=D ?N=A ?N=D ?S=A ?S=D" to exclude_urls (without the
quotes) to suppress the alternately sorted views of the directory.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Indexing german pages

2001-01-03 Thread Gilles Detillieux


According to Radoy Pavlov:
 I have some questions regarding german language.
 Following the example in FAQ I've made my htdig.conf,
 extracted GermanWords.zip in $COMMON_DIR/german and edited htdig.conf.
 I've done this:
 rerun of rundig
 rerun of htfuzzy endings
 Still htdig cant find any words with umlauts (äöü etc), altho I have
 near
 30 MB of databases.
 The search page shows that it is searching for the word .. with no
 effect.
 
 My search algorithm:
 search_algorithm:  exact:1 endings:0.5 prefix:0.4
 
 Perhaps I need to optimize the algorithm in order to get some matches?
 What is a "correct" algorithm ?

No, the search algorithms are not likely the problem.  If you can't even
get an exact match, the problem lies elsewhere, and in this case I'd bet
it's a problem with locales.

You didn't mention what system you are running htdig on, and what your
locale setting is.  Some systems don't have properly functioning locale
support at all (e.g. many libc-5 based Linux systems), and many don't have
complete locale tables installed.

See the thread entitled "Portuguese" from this past May, for more pointers
on locale-related problems:

http://www.htdig.org/mail/2000/05/index.html#61

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Going for the big dig

2000-12-19 Thread Gilles Detillieux


According to Terry Collins:
 Geoff Hutchison wrote:
  
  At 10:14 AM +1100 12/19/00, Terry Collins wrote:
  And make sure you don't ignore robots.txt
  
  Yes, though someone would need to alter the code to do this.
 
 If you are doing an external site, it shouldn't be to much effort to
 just read this and set the excludes.
 
 Courtesy thing.

I think you misunderstood.  htdig already does read the robots.txt file
and skips all disallowed documents.  You don't need to do this manually.
Geoff was saying you'd need to alter the code in order to ignore robots.txt,
which definitely would be a bad thing if you then use the hacked htdig to
index sites that are not your own.

Actually, on my site I don't bother with exclude_urls at all, and use the
robots.txt file instead.  This way, anything that I don't want indexed by
htdig won't be indexed by any other search engine either.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Hi, need help with searching database.

2000-12-18 Thread Gilles Detillieux


According to Akshay Guleria:
 thanx Gilles,
 However my problem is now fixed. I am using the following.
 htdig-3.2.0-0.b2.i386.rpm

Ah, I wasn't aware that this version was out in RPM.  There are many
known bugs in 3.2.0b2, so don't be surprised if other problems occur.
The scoring bug in htsearch is very likely to turn up, unless this
RPM included a patch for this bug.

 There were 2 problems (in case you are interested):
 1. htaccess files not allowing the rundig to connect to the server.

There's not much htdig can do about this.  If the .htaccess file sets
up Basic authorization, then you can use the -u option to provide the
user name and password to the server.  If the .htaccess file set up
some other restrictions, you're out of luck, but then these pages would
also be inaccessible from a standard web browser coming in from the same
address.

 2. This file in /var/lib/htdig needed webserver owner's ownership.
 db.words.db_weakcmpr
 
 As soon as I owned it by "apache", it worked. I dont know but I think the
 rpm packager should have have taken care of this.
 Anyway, it works now.
 
 Thanx a lot for getting back.

This point was covered in FAQ 5.17.  I didn't realize you were running
a 3.2 beta before.  It's very important to mention which version you're
running, because many, if not most, of the bugs and problems that come
up are version-specific.  There's not much the RPM packager can do
about this particular bug in htdig, because the db.words.db_weakcmpr
file is not normally part of the RPM distribution - it's only created
after installation, when you run the rundig script.

 -Original Message-----
 From: Gilles Detillieux [mailto:[EMAIL PROTECTED]]
 Sent: Wednesday, December 13, 2000 11:20 PM
 To: [EMAIL PROTECTED]
 Cc: [EMAIL PROTECTED]
 Subject: Re: [htdig] Hi, need help with searching database.
 
 
 According to Akshay Guleria:
  I just installed Redhat7.0 on my machine. And then installed htdig rpm. I
  can see the page
  http://myhost/htdig/ which is the search page.
 
 Which htdig rpm did you install?  For Red Hat 7.0, you should use the RPM
 for htdig-3.1.5-6 that comes with the 7.0 PowerTools.
 
  I make a search and for any search I make, it returns a page saying
  "No matches found for ... "
 
  Now, I ran rundig and it increased the file sizes in /var/lib/htdig. So, I
  presume the database was created. And then I ran htmerge. But I still get
  the
  "No matches found .." page.
 
 If you run rundig, you don't need to run htmerge separately.  The rundig
 script will run htdig followed by htmerge.  You should try running
 your /var/www/cgi-bin/htsearch program right from the command line
 first, to see if that works.  If it does, it may be an Apache server
 configuration problem, or a problem with your search form.  Did you
 make any changes to the /var/www/html/htdig/search.html search form?
 If so, see http://www.htdig.org/FAQ.html#q5.17
 
 --
 Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
 Spinal Cord Research Centre   WWW:
 http://www.scrc.umanitoba.ca/~grdetil
 Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
 Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930
 
 
 To unsubscribe from the htdig mailing list, send a message to
 [EMAIL PROTECTED]
 You will receive a message to confirm this.
 List archives:  http://www.htdig.org/mail/menu.html
 FAQ:http://www.htdig.org/FAQ.html
 


-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Words and files not being found or indexed

2000-12-18 Thread Gilles Detillieux


According to crosstar:
 This is a message for Gilles or anyone who is "senior" enough with the
 program to answer.
 
 I had written to Gilles, earlier, and he had said to post the questions here.

I've been away from work and e-mail since last Wednesday, so I didn't get
caught up on this thread until today.  You see, there's a reason why I
always redirect people to the list!  I'm rather glad I missed this thread,
actually, as the whole thing seems to have been an exercise in frustration.

From the outset, I referred you to FAQ 5.25, but I didn't see any evidence
from this whole, very long thread of discussion that anyone had looked at
or followed the suggestions there.  Was the language used in that question
so indecypherable that no one could get anything from it?  I realize it's
written in technical language, but setting up a search engine correctly is
a pretty technical problem, so if you don't understand the basics of Unix
or Linux and how web servers work, you really should read up on that before
attempting something like this.

Anyway, if anyone can contribute suggestions as to how this FAQ entry can
be better written, I'd be glad to hear them.  If the problem was simply
that no one bothered to look at the FAQ, then why am I wasting my time
trying to update it?

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Q

2000-12-18 Thread Gilles Detillieux


According to ellenliu:
 Dear Sir:
   First ,I send  my great gratitude to Gilles R. Detillieux
 and Daniel Naber for their warmhearted help. 3.2.0b3 has been installed
 on my system successfully.

Which snapshot did you use?  3.2.0b3 is still a work in progress, and is
slowly but surely getting closer to be ready for beta release.  The
121700 snapshot is the most stable one so far.

   Here I have an another question :I had read through the source
 code before installing,but I want to trace some codes also now,would
 you please tell me which develope tool is good at debugging and/or
 tracing C/C++ program for Red Hat Linux platform?

I think most Red Hat Linux users would suggest gdb, or perhaps xxgdb.
If the C++ program you're debugging is htdig, I'd also suggest using
the debugging output already programmed into it, and activated with
multiple -v options, as you get a lot of feedback that way.  (I'm a
big believer in debugging trace prints in general, and do most of my
C/C++ debugging that way.)

   Moreover,I had run it on my LAN,but when I search some words,it
 always gave me " no found "page,(I run it like this command line:
 htsearch word).I'd like to know whether this problem is caused by my
 operation reason.

You should run htsearch from the command line either with no arguments
at all, and let it prompt you for the search words, or you should give
it a full CGI-style query string as an argument, e.g.:

   /opt/www/cgi-bin/htsearch 'words=butterfly+valvemethod=and'

Be sure to quote the query string if it contains any shell meta characters
such as "", ";", "*", etc.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] locale:ru on Solaris

2000-12-13 Thread Gilles Detillieux


Sorry, but there's absolutely nothing I can do with the core file itself,
as I don't have a Solaris system.  What I want is for you to get a stack
backtrace using your htsearch executable and your core dump file using
your debugger.  Either that or run htsearch directly under your debugger,
and when it fails get the stack backtrace directly from the in-memory copy
of the program.  If you use gdb, the procedure is described in the FAQ.
If you use another debugger, you'll need to figure out how to do it with
that debugger.

According to Eldar Imangulov:
 Hello!
 
 Thanks for your will to help me.
 
 Here it is the core.
 
 
 
 Regards,
 Eldar Imangulov
 project manager (design  hosting)
 [EMAIL PROTECTED]
 phone/fax.: +7 095 777.09.10
 
 Global Chance
 Bld.1, 42 Bolshaya Yakimanka st.,
 Moscow 117049 Russia
 
 //  -Original Message-
 //  From: Gilles Detillieux [mailto:[EMAIL PROTECTED]]
 //  Sent: Tuesday, December 12, 2000 8:33 PM
 //  To: Eldar Imangulov
 //  Cc: [EMAIL PROTECTED]
 //  Subject: Re: [htdig] locale:ru on Solaris
 //  
 //  
 //  According to Eldar Imangulov:
 //   I'm useing Solaris 7
 //   
 //   I made the htDig and now I try to make search my site in russian
 //   (windows-1251).
 //   
 //   in htdig.conf I said the
 //   locale : ru
 //   
 //   The website indexing is going well but the htsearch does not work
 //   (coredump).
 //   
 //   But without russian language (indexing by default = without 
 //  locale:ru)
 //   indexing  htsearch works well togather.
 //   
 //   What is the problem???
 //  
 //  Hard to say, but from what you describe it sounds like a 
 //  problem with the
 //  locale tables for your locale, or a database corruption problem of some
 //  sort, perhaps.  Could you give us a stack backtrace of htsearch's core
 //  dump to narrow things down a bit?
 //  
 //  See the latter part of http://www.htdig.org/FAQ.html#q5.14


-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Htdig as external Link Checker? (Maybe off-topic)

2000-12-13 Thread Gilles Detillieux


According to Reich, Stefan:
 I need to generate a List for my boss, which contains all external Links of
 our Web-Site (which gets already indexed by htdig) including the status
 (means if the target of this link exists or not)

You should have a look at Gabriele's ht://check program, which is partly
based on htdig.  It's on the sourceforge.org web site, I believe.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] result count is too small ?

2000-12-13 Thread Gilles Detillieux


According to Dennis Director:
 I am running htdig-3.2.0b2, I recently moved from htdig-3.1.5.
 Sometimes, the result count that I get back from a search is too small.
 For instance, below it said I have ten matches but only gave me two.

It's hard to say for sure what's happening, but 3.2.0b2 has a number of
known bugs, which are fixed in the latest development snapshot for 3.2.0b3.
The infamous scoring bugs might account for the behaviour you see.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

[htdig] Re: I need your help [from ellenliu]

2000-12-13 Thread Gilles Detillieux


Hi, Ellen.  First of all, you should always send these questions to
the list, and not to me personally.  I don't have all the answers.
See http://www.htdig.org/FAQ.html#q1.16

According to ellenliu:
 Dear Gilles R. Detillieux:
   I'm very grateful for your kind help last time.
   
   All these problems happened before compilation,during the Configure process.
  
   Because I can't get the most recent development snapshot of 3.2.0b3

They're in http://www.htdig.org/files/snapshots/

However, if you don't need any of the new features in the 3.2 series, you're
probably better off with 3.1.5.

 I  run 3.1.5 instead,but there  still exit some problems.
 I entered :
 "sh ./configure" ,
 
 it prompts:
 ".
 checking host system type ... ./configure: ./config.guess: no such file or directory 
configure
 configure:error:can not guess host type ;you must specify one
 configure :error :./configure failed for db/dist"
 I think that it can't pass through the check of 'host system type',I
 have read through the ./config.guess file ,but I 'm not clear what
 should I do yet.I know the default value of $host is NONE,whether need
 I set a type according to my machine?
 
 as I said last time when I run 3.2.0b2
 the output  prompts:
 "
 ...
 checking whether make sets ${MAKE}(cached) yes
 configure :error: can not run ./config.sub"
 
 in ./configure file I find the line (933):"if ${CONFIG_SHELL-/bin/sh} $ac_config_sub 
sun4 dev/null 21;then "
 why set  the parameter sun4 ?
 
 would you tell me what I shoulddo next ?
 Thanks.
 configure:
 cpu :PIII 550M
 os: red hat linux 6.2 kernel 2.2.14-5.0

We've never seen anything like this before on Red Hat Linux systems of any
version.  Certainly not on 6.2.  As I said last time, you may very well be
missing some critical packages from your Red Hat distribution which are
needed to compile and install software.

The other thing I'm noticing is that there seems to be a problem with
execution of scripts on your system.  How did you extract the files from
the .tar.gz distributions of either 3.1.5 or 3.2.0b2?  Did you use chmod
on any of the files, and in doing so turn off execute permissions on them?
If you did, that's definitely going to be a problem!

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Hi, need help with searching database.

2000-12-13 Thread Gilles Detillieux


According to Akshay Guleria:
 I just installed Redhat7.0 on my machine. And then installed htdig rpm. I
 can see the page
 http://myhost/htdig/ which is the search page.

Which htdig rpm did you install?  For Red Hat 7.0, you should use the RPM
for htdig-3.1.5-6 that comes with the 7.0 PowerTools.

 I make a search and for any search I make, it returns a page saying
 "No matches found for ... "
 
 Now, I ran rundig and it increased the file sizes in /var/lib/htdig. So, I
 presume the database was created. And then I ran htmerge. But I still get
 the
 "No matches found .." page.

If you run rundig, you don't need to run htmerge separately.  The rundig
script will run htdig followed by htmerge.  You should try running
your /var/www/cgi-bin/htsearch program right from the command line
first, to see if that works.  If it does, it may be an Apache server
configuration problem, or a problem with your search form.  Did you
make any changes to the /var/www/html/htdig/search.html search form?
If so, see http://www.htdig.org/FAQ.html#q5.17

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] htdig missing subdirectories (was: Incremental indexing)

2000-12-13 Thread Gilles Detillieux


Please direct your questions to the list, not to me personally.
See FAQ 1.16.  Also, you're off topic, as this has nothing to do with
last week's "Incremental indexing" thread, so you should pick a more
descriptive subject.

According to crosstar:
 I have copiously poured over the messages
 in the mailing list, as well as references in FAQ.
 I am not very technical, but my situation is that htdig is
 missing a lot of files, words and subdirectories, altogether.
 
 I'm wondering if there is a simpler adjustment in
 htdig.conf to remedy this?  I simply do not understand
 the instrtuctions, as given, unfortunately, and note that
 one reader says that he thinks tinkering with the
 server is not the answer.

Did you follow the recommendations in FAQ 5.25  5.27?  That's probably
where you should focus your attention.  Running htdig with the -vvv
option will give you tons of output, but if you trace your way through
there you might be able to see why it's missing parts of your site.

 I tried running htfuzzy but get the error:
 htfuzzy: No algorithms specified 

You need to tell htfuzzy which database to build.  This won't solve your
problem above, though.  It's just for building databases for fuzzy match
algorithms.

 I have changed one default up upping to: 
 max_head_length:5

That will make htdig keep more of each document for use in excerpts for
matched pages, but it won't get you more matches.  However, upping the
max_doc_size may get htdig to index more stuff if it was missing links from
really large pages.  See FAQ 5.1.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] locale:ru on Solaris

2000-12-12 Thread Gilles Detillieux


According to Eldar Imangulov:
 I'm useing Solaris 7
 
 I made the htDig and now I try to make search my site in russian
 (windows-1251).
 
 in htdig.conf I said the
 locale : ru
 
 The website indexing is going well but the htsearch does not work
 (coredump).
 
 But without russian language (indexing by default = without locale:ru)
 indexing  htsearch works well togather.
 
 What is the problem???

Hard to say, but from what you describe it sounds like a problem with the
locale tables for your locale, or a database corruption problem of some
sort, perhaps.  Could you give us a stack backtrace of htsearch's core
dump to narrow things down a bit?

See the latter part of http://www.htdig.org/FAQ.html#q5.14

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] htdig dumps core on Linus

2000-12-08 Thread Gilles Detillieux


According to B.G. Mahesh:
 Linux: 2.2.14-5.0smp (Redhat 6.2)
 HTDIG: 3.1.5
 Apache: 1.3.14
 
 When I search for few the word "rajkumar" on
 the news finder window on
  http://news.indiainfo.com/2000/12/08/india-index.html
 it gives me an error. When I check the cgi-bin dir I see a core file.
 
 % file core
 core: ELF 32-bit LSB core file of 'htsearch' (signal 11), Intel 80386,
 version 1
 
 Why does this happen?

It's hard to say without getting a stack backtrace from the core dump.
First of all, did you install from an RPM or compile from sources.
If you installed the wrong RPM, that could potentially lead to such
problems.  Mostly, though, this is a symptom of database corruption,
which can happen if for example you have two htdig processes updating
the database simultaneously.  Did you try rebuilding the database from
scratch, e.g. using "rundig", to see if that makes the problem go away?

See the last paragraph of FAQ 5.14:

http://www.htdig.org/FAQ.html#q5.14

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] PDF problem

2000-12-08 Thread Gilles Detillieux


According to [EMAIL PROTECTED]:
 I am using htdig 3.1.5 on Linux. I get these errors when I try to index
 the files
 
 How can I fix the problem
 
 [ii@iinj-lxs015 bin]$ 
/disk2/v/apache/htdocs/VIRTUAL/ii/search/HTDIG//db/htdig11551.pdf: Unterminated 
string.
 PDF::parse: cannot open acroread output from 
http://www.indiainfo.com/awards/ET-ArmyInKashmir.pdf
 /disk2/v/apache/htdocs/VIRTUAL/ii/search/HTDIG//db/htdig11551.pdf: Could not repair 
file.
 PDF::parse: cannot open acroread output from 
http://travel.indiainfo.com/utilities/passport/passport_app.pdf
 /disk2/v/apache/htdocs/VIRTUAL/ii/search/HTDIG//db/htdig11551.pdf: Could not repair 
file.
 PDF::parse: cannot open acroread output from 
http://travel.indiainfo.com/utilities/passport/lostpp.pdf

The "Could not repair file" error message is usually a sign that the PDF
files are being truncated because of a setting of max_doc_size that's too
low.  See http://www.htdig.org/FAQ.html#q5.2

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Daft Question - How to Apply patch under Solaris - Bit off

2000-12-08 Thread Gilles Detillieux


According to Duncan Brannen:
   I'm trying to apply the aarmstrong URL rewite patch to htdig-3.1.5
 I assumed I use
 
  patch -i htdig.diff
 
 under Solaris (8)
 
 however, I assumed it would pick up the file names to be patched since
 they're in there but nope - I have to specify the names then I get
 
   patch -i htdig.diff
Looks like a new-style context diff.
 File to patch: htdig/Retriever.cc
 Malformed patch at line 16:
 patch: Line must begin with '+ ', '  ', or '! '.
 
 (This is where the next diff line starts)
 
 If I chop the file up into separate diffs and apply them individually it 
 all works fine
 
 The man file for Path really sounds like it should read the file and work 
 it out
 for itself.  Am I missing something?

Yes, whenever you want to patch files in subdirectories, or use patch files
with pathnames in the filenames, you need to use the -p option to tell the
patch command how the pathnames are supposed to line up on your filesystem.
In this case, you should go into the main htdig-3.1.5 source directory and
use "patch -p1  htdig.diff".  The -p1 tells patch to strip off the first
pathname component from file names in the patch file.  See "man patch".
I'm not sure what the -i option is for.  The GNU version of this command
doesn't seem to have a -i.

The error message you got at line 16 is a bit worrisome, as this does seem
to be a properly formed patch, so I don't know why it's expecting a bigger
hunk of diff code than it's getting.  You may need to switch to the GNU
version, or apply the patch by hand (it consists of fairly simple additions).

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] core dump--help

2000-12-08 Thread Gilles Detillieux


According to Shakaib Sayyid:
 I am getting a core dump on htsearch 3.1.5 using linux 6.2--2.2.14-5.0kernel.
 following is the output from "gdb htsearch core":
 
 Core was generated by `htsearch'.
 Program terminated with signal 11, Segmentation fault.
 Reading symbols from /usr/lib/libz.so.1...done.
 Reading symbols from /usr/lib/libstdc++-libc6.1-1.so.2...done.
 Reading symbols from /lib/libm.so.6...done.
 Reading symbols from /lib/libc.so.6...done.
 Reading symbols from /lib/ld-linux.so.2...done.
 #0  0x807f05a in __bam_cmp () at HtCodec.cc:20
 20  // End of HtCodec.cc
 (gdb) bt
 #0  0x807f05a in __bam_cmp () at HtCodec.cc:20
 #1  0x8085c27 in __bam_search () at HtCodec.cc:20
 #2  0x8081255 in __bam_c_search () at HtCodec.cc:20
 #3  0x807fa6f in __bam_c_get () at HtCodec.cc:20
 #4  0x8067217 in __db_get () at HtCodec.cc:20
 #5  0x805ab66 in DB2_db::Get (this=0x80d85f0, key=@0xbfffe250, data=@0xbfffe2b0)
 at DB2_db.cc:334
 #6  0x8059ddb in Database::Get (this=0x80d85f0, key=0x80ec2a0 "peter", 
 data=@0xbfffe2b0) at Database.cc:77

Looks like a database corruption problem.  Try rebuilding the database
from scratch, e.g. using "rundig", and see if that gets rid of the
problem.  If the problem recurs after this, it could indicate something
else is going wrong.  Note too that there is no lockout on the database,
so if you accidentally start two processes (htdig and/or htmerge) that
try to update the database simultaneously, that can really mess up
the database.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] SQL handling start_url

2000-12-07 Thread Gilles Detillieux


According to Curtis Ireland:
 Is there any way to have start_url get its list from an SQL back-end?
 Has anyone already built a patch to handle this?
 
 Here are a couple of solutions I can think of to bi-pass the problem,
 but I'm sure I'm not alone in desiring this feature.
 
 1) Build a PHP link built with links to all the sites we want to index.
 Have htDig use this as its start_url
 2) Before htDig starts its database build, dump all the links to a text
 file and have the htdig.conf include this file
 
 The one problem with these two solutions is how would the limit_urls_to
 variable work? I want to make sure the links are properly indexed
 without going past the linked site.

Either solution seems workable - it all depends on what your preference
is.  For the first solution, you'd need to have a limit_urls_to setting
that's liberal enough to allow through all the links that the PHP script
will spit out.  You should probably set your max_hop_count to 1 to avoid
having htdig go beyond the first hop, from the PHP output to the documents
it references.

For the second solution, you could probably just leave limit_urls_to as
the default, which is the same as the value of start_url, and set your
max_hop_count to 0.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Pb indexing HTML with htdig 3.1.5

2000-12-07 Thread Gilles Detillieux


According to =?iso-8859-1?Q?Andr=E9?= LAGADEC:
 I use htdig 3.1.5 on a Red Hat Linux 5.0, and I want to index a new web
 site. But when I run rundig I get only one document.
 
 So to see what is doing, I use rundig -vvv and I get this output :
 Header line: HTTP/1.1 200 OK
 Header line: Server: Netscape-Enterprise/3.5.1C
 Header line: Date: Wed, 06 Dec 2000 07:32:02 GMT
 Header line: Content-type: text/html
 Header line: Last-modified: Mon, 15 Nov 1999 10:45:01 GMT
 Translated Mon, 15 Nov 1999 10:45:01 GMT to 1999-11-15 10:45:01 (99)
 And converted to Mon, 15 Nov 1999 10:45:01
 Header line: Content-length: 1258
 Header line: Accept-ranges: bytes
 Header line: Connection: close
 Header line: 
 returnStatus = 0
 Read 1258 from document
 Read a total of 1258 bytes
 Tag: html, matched -1
 head:  
  size = 1258
 pick: x.y.z.t, # servers = 1
 htdig: Run complete
 htdig: 1 server seen:
 htdig: x.y.z.t:8000 1 document

You should be getting much more output than that with a verbosity level of
7!  Is it possible that there is a NUL byte in the document, soon after the
"html" tag?  For some reason, htdig seems to be stopping right after this
tag, and not getting anywhere close to the other tags in the document.  I've
tried it myself on the document you sent, and on that copy it worked fine.
The comment around the JavaScript code is correct, and htdig was able to
handle it.  There must be something different in your copy of the document,
such as a NUL byte, which is causing htdig's parser to end prematurely.

 I think that htdig doesn't like the HTML code "!--//" and "//--", and
 it see beginning of comment but not the end and ignore the rest of HTML
 code of the page.
 
 I am true ? An other idea ? What can I do ?
 
 N.B. : The HTML code of the first page on the site is under this line.
 _
 html
 
 head
 titleAccueil DIRECTION/title
 base target="rtop"
 script language="JavaScript"
 !--//
 var url="";
 var nom="";
 var bName="";
 
 function Ouvrir()
 {
 bName = navigator.appName
 Version = navigator.appVersion
 Version = Version.substring(0,1)
 browserOK = ((Version = 2))
 
 if (browserOK) 
 {
 this.name="home";

 
msgWindow=window.open("actu/default2.htm","popupdpd","location=no,toolbar=no,status=no,directories=no,scrollbars=yes,width=400,height=450");
 bName=navigator.appName;
 if (bName=="Netscape") msgWindow.focus();
 
 }
 }
 Ouvrir()
 
 //--
 /script
 /head
 
 frameset framespacing="0" border="false" frameborder="0" cols="155,*"
   frame name="gauche" scrolling="no" noresize target="haut_droite"
 src="defaulta.htm"
   marginwidth="0" marginheight="5"
   frameset rows="*,45"
 frame name="texte" target="bas_droite" src="defaultb.htm"
 scrolling="auto"
 marginwidth="0" marginheight="0" noresize
 frame name="bas" src="basac.htm" scrolling="no" marginwidth="7"
 marginheight="15"
 noresize
   /frameset
   noframes
   body
   pCette page utilise des cadres, mais votre navigateur ne les prend
 pas en charge./p
   /body
   /noframes
 /frameset
 /html


-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Incremental indexing

2000-12-07 Thread Gilles Detillieux


According to Wanrong Qiu:
 Does htdig support incremental indexing? I mean it is possible to only
 index new created or modified files. Thanks in advance.

Yes, this is what htdig does by default if there is an existing database,
and the htdig program is called without the -i (initialize) option.
However, the rundig script that comes with the package calls htdig with
the initialize option, as its main purpose is to create all the initial
databases, so don't use the standard rundig script for update runs.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] htdig fails to parse all files

2000-12-07 Thread Gilles Detillieux


According to Jeffery T Aiken:
 I've compiled htdig 3.1.5 on a Solaris 2.6 system.  I have 5 directories on my
 web server containing a total of 54190 html docs and when I run htdig it only
 finds just over 18,000.  I've used the -vvv -s options and see no errors during
 the dig.  I am able to successfully htmerge these into the database and search,
 but can't figure out why htdig doesn't see them all.
 
 Anybody have an idea where I can go from here?

Have you looked at FAQ 5.25  5.1 ?

 FAQ:http://www.htdig.org/FAQ.html

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Htdig with geramn umlaut under slackware

2000-12-06 Thread Gilles Detillieux


According to Jun Dong ([EMAIL PROTECTED]):
 Thanks for your tips.
 In Slackware 7.0 Packages there is no files of LC_CTYPE , LC_* etc.. under
 /usr/lib/locale/de or deutsch.
 Under /usr/lib/locale/de is only Directory LC_MESSAGES.
 I have copied directory de_DE which includs all files LC_* from SUSE 6.2 to
 SLACKWARE /usr/lib/locale und made symblolink de - de_DE.
 With your testlocale.cc code,  after the code compiIed, I give command testlocale
 de
 and the screen prints out exactly german accents with Umlaut.
 But unfortunately Htdig is always no function with german accents despite how I
 exactly
 configured Htdig.conf. This is really system problem from Slackware.

The problem with copying from a different system is that the C library
may be different, and therefore may require a different set of file
formats for locale support.  This was the case in the transition from
libc5 to glibc.  However, if testlocale.c did recognize the German
umlauts as alphanumeric, then it would suggest that things are mostly
working correctly.  I don't know why, but there are a few systems where
this test program works, but htdig's locale support doesn't.  I don't
know what else to point the finger at besides the C library, though.

 In other way I have found the Tips from:
 ftp://sol.ccsf.cc.ca.edu/htdig/paches/3.1.5/accents.zip.README
 I have modified HTML.cc and htsearch.cc again and recompiled Htdig and no more
 definition
 with locakle again. Finally Htdig with german accents is successfully installed.
 you can find the url where I installed Htdig:
 http://www.homepagemagazin.de/htdig/

The problem with the accents.zip patch is that it ends up stripping
off all accents by converting all accented letters in the ISO-8859-1
character set to their unaccented counterparts.  So, the excerpts won't
contain the accents.  While this isn't as nice as the accents.5 patch,
which adds accent support as a new fuzzy match method, the patch you
used is at least better than nothing for a system that doesn't properly
support locales.

 Gilles Detillieux wrote:
 
  I believe there are still problems with locale support on Slackware Linux
  systems.  See the thread entitled "Portuguese" from this past May:
 
  http://www.htdig.org/mail/2000/05/index.html#61
 
  I never did get a followup message from Rodrigo indicating whether he had
  found a solution, but you may want to try the tips I gave him.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] 3.1.5 Compile problems on Linux

2000-12-06 Thread Gilles Detillieux


According to Foerst, Daniel P.:
 I am using RedHat 6.2 with GCC 2.95.2 with GNU ld 2.9.5, and I have
 libstdc++ 2.9.0-30 installed (latest version). This is htdig-3.1.5 
 
 I am not able to figure out what is going wrong.. any assistance you can
 lend is greatly appreciated!
...
 I run the configure and have the following...
...
 prefix= /home2/htdig
 
 # This specifies the root of the directory tree to be used for programs
 # installed by ht://Dig
 exec_prefix=${prefix}

I'm not positive about this, but I think in makefiles like this one, you
need to use the syntax $(prefix), and not ${prefix} (i.e. use parentheses
instead of braces).

...
 When I run make, everything works well, but then this slew of errors
 takes place.
 
 Entering directory `/sys2/installs/htdig-3.1.5/htfuzzy'
 gcc -o htfuzzy -L../htlib -L../htcommon -L../db/dist -L/usr/lib
 Endings.o EndingsDB.o Exact.o Fuzzy.o Metaphone.o Soundex.o
 SuffixEntry.o Synonym.o htfuzzy.o Substring.o Prefix.o
 ../htcommon/libcommon.a ../htlib/libht.a ../db/dist/libdb.a 
 EndingsDB.o: In function `Endings::createDB(Configuration )':
 /sys2/installs/htdig-3.1.5/htfuzzy/EndingsDB.cc:46: undefined reference
 to `cout'
 /sys2/installs/htdig-3.1.5/htfuzzy/EndingsDB.cc:46: undefined reference
 to `ostream::operator(char const *)'
 /sys2/installs/htdig-3.1.5/htfuzzy/EndingsDB.cc:52: undefined reference
 to `cout'
 /sys2/installs/htdig-3.1.5/htfuzzy/EndingsDB.cc:52: undefined reference
 to `ostream::operator(char const *)'

All of these should be in the libstdc++ library.  However, the makefile
is trying to link these with gcc rather than g++ or c++, which is probably
a big part of the problem.  I suspect something went wrong during the
run of ./configure, most likely because your C++ compiler and libraries
aren't installed where the configure program expected to find them.

...
 /sys2/gcc/lib/gcc-lib/i686-pc-linux-gnu/2.95.2/../../../../include/g++-3
 /iostream.h:106: undefined reference to `endl(ostream )'
 /sys2/gcc/lib/gcc-lib/i686-pc-linux-gnu/2.95.2/../../../../include/g++-3
 /iostream.h:106: undefined reference to `cout'
 /sys2/gcc/lib/gcc-lib/i686-pc-linux-gnu/2.95.2/../../../../include/g++-3
 /iostream.h:106: undefined reference to `endl(ostream )'

These error messages suggest that the C++ header files are not in the
standard location.  The compiler found them OK, but things are messing
up and the linking stage.  Is there a reason why you didn't just use
the egcs-c++ and libstdc++ RPM packages that came with Red Hat 6.2?
Those work fine with ht://Dig.  I suspect that your setup as it is now
wouldn't work well with any software that needs C++.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Named characters in search output

2000-12-05 Thread Gilles Detillieux


According to Tamas Nagy:
 Hello,
 
 When using "rarr;" (right arrow) named character in the first part of HTML
 documents, htdig seems to generate "amp;rarr; romaacute;" in the preview
 of documents. It is a bit strange, maybe a bug, because this string should
 generates a right arrow...
 
 Cheers,
 
 Tamas
 
 PS:
 Config: HtDig 3.0.2b2, RedHat 7

I assume you mean 3.2.0b2.  This is a known problem, which is fixed in
the 3.2.0b3 development snapshots.  See http://www.htdig.org/FAQ.html#q5.22

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] How Can I use htdig to index two or more websites?

2000-12-05 Thread Gilles Detillieux


According to Sean Harris:
 How Can I use htdig to index two or more websites?
 Thank you for your help!:-)

Just add all the URLs you want to the start_url attribute, and possibly
adjust limit_urls_to if you want something less limiting than what you've
put in start_urls.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Help me to Search using Chinese!!!!!!!

2000-12-05 Thread Gilles Detillieux


According to Sean Harris:
 Help me to Search using Chinese!!!

I'm afraid the answer hasn't really changed from 2-1/2 weeks ago.
ht://Dig only supports 8-bit character sets.
See http://www.htdig.org/FAQ.html#q4.10

This topic has been discussed many times on the list, and there are still
no volunteers to take on the huge amount of work it would require to
adapt ht://Dig for full Unicode support, and to add in the word splitting
algorithms needed for many Asian languages.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Do me a favor

2000-12-05 Thread Gilles Detillieux


According to ellenliu:
 I have downloaded the program of 'htdig-3.2ob2.tar' from your site.
 But I have trouble to run it on personal my computer.
 My computer has been installed 'Red Hat 6.2' , which kernel is 2.214.
 However, when I run '/configuer' ,on the 993 lines it calls 'config.sub' ,then the  
program exits along with the promotion 'can't run config.sub'
 .
 Would you do me a favor to tell me why this happened ,and the most important thing 
is how I can run it successfully?
   Moreover, when should the embedded  database  be  compiled ,and how  is it 
compiled?   
 CONFIGUER of HARDWARE:
   CPU : Pentium processor 550
   Hard disc: 20G
   Memery: 64M

It would probably be helpful to see the full output from the ./configure
program.  This package has been successfully installed before on Red Hat
systems (6.1, 6.2 and others), so I would think that the most likely
problem is a missing component on your system.  You may also want to try
the most recent development snapshot of 3.2.0b3, instead of 3.2.0b2, as
many known bugs in 3.2.0b2 have since been fixed.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] restrict values and htdig.conf

2000-12-05 Thread Gilles Detillieux


According to [EMAIL PROTECTED]:
 We have 3 htdig-searches in our website.
 there are 3 different databases that are indexed:
 ../htdig/db/database1 indexed with ../conf/htdig1.conf
 ../htdig/db/database2 indexed with ../conf/htdig2.conf
 ../htdig/db/database3 indexed with ../conf/htdig3.conf
 
 the database3 includes all sites while the other 2 databases contains only parts
 of the whole.
 
 Now i want to expand the html-form with a select-option as follows:
 
 select name="restrict"
 option value=""  selected.. Database1/OPTION
 option value="http://www.../"on Database2
 option value="http://www.../"on Database3
 /OPTION
 /SELECT
 
 o.k.!
 
 but how can i use this restrict-value in my htdig.conf?
 According to the selection in the html-form i must call the right htsearch with
 the right database!

You seem to be confusing two alternate methods of restricting search
results.  You use the restrict parameter on htsearch only when searching
a database that contains everything, in order to restrict the results
to a subset of that database, i.e. only the URLs that match a particular
pattern.

If you want the user to select separate databases, then you should leave
the restrict input parameter as an empty string, and have the user select
the value of the "config" input parameter, which should be one of htdig1,
htdig2 or htdig 3, i.e. the three configuration files you mentioned above
with the directory and .conf file name extension stripped off.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Htdig with geramn umlaut under slackware

2000-11-29 Thread Gilles Detillieux


According to jdong:
 i have installed Htdig under slackware 7.0 and configured as german
 version. in htdig.conf:
 locale: de_DE
 lang_dir: ${common_dir}/german
 bad_word_list: ${lang_dir}/bad_words
 endings_affix_file: ${lang_dir}/german.aff
 endings_dictionary: ${lang_dir}/german.0
 endings_root2word_db: ${lang_dir}/root2word.db
 endings_word2root_db: ${lang_dir}/word2root.db
 were added und files bad_words,german.aff and german.0 are copied under
 those directory.  Everything is goning ok. Htdig can find every words
 except german umlaut such as ä (ä) ...
 
 my Linux Slackware was installed as german version, wenn i tip
 locale -a in command line:
 locale -a
 ..
 de
 deutsch
 de_DE
 
 Whatever i set LANG and LC_CTYPE = de_DE or de or deutsch, htdig is
 always no search funktion with german umlaut.  But same htdig installed
 under Linux SUSE 6.2 und same configured there is no problem with
 german umlaut.
 
 I don't known how can i configure slackware locale and resolve this
 problem?

I believe there are still problems with locale support on Slackware Linux
systems.  See the thread entitled "Portuguese" from this past May:

http://www.htdig.org/mail/2000/05/index.html#61

I never did get a followup message from Rodrigo indicating whether he had
found a solution, but you may want to try the tips I gave him.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] 3.20b2 -- oddity

2000-11-29 Thread Gilles Detillieux


According to [EMAIL PROTECTED]:
 I tried to do htsearch, using the following .conf file:
 
 site_id:10009
 include:/www/vhosts/a/autosearchusa.com/htdig3.2b2/conf/cv_0.conf
 database_dir: /www/vhosts/a/autosearchusa.com/htdocs/www/u-wrk
 /sngl/data
 database_base:  ${database_dir}/dt_${site_id}
 
 Point of interest is that, within the included file, values of 
 database_dir/base were
 database_dir:   /www/vhosts/a/autosearchusa.com/htdocs/www/u-dvl/
 sngl/data
 # this way for htdig
 database_base:  ${database_dir}/dt_${site_id}
 
 Wanted data was in the " . . . u-wrk . . " node.  
 
 Initial search found wanted data.  Second search (for 11th, etc, result), 
 however, tried to obtain data from the "u-dvl" (and failed due to not there). 

 
 Changed "u-dvl" to u-wrk, in the included file, and all worked as intended 
 (did, btw, verify that SAME config file was being used at all points).
 
 Seems as if override of database_base should either happen, or not happen, 
 consistently.  

There were two bugs in 3.2.0b2, which are fixed in the 3.2.0b3 development
snapshots, which would have worked together to cause the behaviour you
observed.  The first was that on followup pages for a given search, the
"config=" parameter got doubled up, causing the configuration to be read
twice.  The second was that the stack that tracks includes got messed up
after the first main config file was read, so when extra config parameters
were handled, includes in these additional config files weren't handled
properly causing the parser to stop reading right after the include.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Decoding -v output.

2000-11-28 Thread Gilles Detillieux


According to Eric Bliss:
 Is there any place where I can find a listing of what each field of the 
 -v output of htdig is for and what the various values (including the 
 part where it gives the - + and *) mean?

Here's one for the FAQ...

When htdig -v spits out a line like this:

23000:35506:2:http://xxx.yyy.zz/index.html: ***-+--++***+ size = 4056

The first number is the number of documents parsed so far, the second
is the DocID for this document, and the third is the hop count of the
document (number of hops from one of the start_url documents).  After the
URL, it shows a "*" for a link in the document that it already visited (or
at least queued for retrieval), a "+" for a new link it just queued, and a
"-" for a link it rejected for any of a number of reasons.  To find out
what those reasons are, you need to run htdig with at least 3 "v" options.
If there are no "*", "+" or "-" symbols after the URL, it doesn't mean
the document was not parsed or was empty, but only that no links to other
documents were found within it.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Decoding -v output.

2000-11-28 Thread Gilles Detillieux


According to Eric Bliss:
 Many thanks for answering this question.  You're right, it should have been
 in the FAQ.

I just committed to CVS the answers to FAQ 5.26 (htdig -v output) and
5.27 (reasons for rejection).  They should be up on the web site within
an hour.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] powepoint and excel to html/text filter

2000-11-27 Thread Gilles Detillieux


According to Cheng-Wei Cheng:
 Re: powepoint and excel to html/text filter
 can anyone give me some pointers
 
 thanks.. 
 cheng

Have a look at the latest version of doc2html on the htdig.org web site:

http://www.htdig.org/files/contrib/parsers/README.doc2html
http://www.htdig.org/files/contrib/parsers/doc2html.tar.gz

You will need to obtain the actual conversion filters that doc2html uses,
but its documentation will tell you where you can find them.

According to David J Adams:
 Version 2.1 uses both the magic number and the MIME type to decide 
 which conversion utlitity to use, and is able to cope with: 
 
 MS Word (most versions including Word2 and Word for MAC) 
 MS Excel 
 MS Powerpoint 
 Wordperfect (purchase of wp2html necessary) 
 Adobe PDF 
 Postscript 
 RTF 
 
 There are number of minor improvements, including a useful improvement 
 in the conversion of PDF files.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Same problem with ~s

2000-11-27 Thread Gilles Detillieux


According to Ing. Noel Vargas Baltodano:
 I just don't know how to make htdig to check the /httpd subdirs and the
 /~username URLs.
 
 If anyone is kind enough to explain it to me AS CLEAR as posible, or
 tell me where I can get the right answer to this problem, I'd be very
 grateful.

And I'd be grateful if someone would help me to make FAQ 5.25 as clear
as possible, as I fear the current wording may be missing the mark.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] problems building htdig on cygwin

2000-11-27 Thread Gilles Detillieux


According to Geoff Hutchison:
 At 10:03 AM -0800 11/26/00, Joe R. Jah wrote:
 There is a chance that stubs.h would also require other heather file(s),
 and those files require yet other files ... ad infinitum;( ;)))
 
 In short, don't hold your breath.
 
 And I have said already if someone can find me a strptime replacement 
 for the systems that don't have it (e.g. BSDI and cygwin evidently), 
 I'll use it instead.
 
 Until then, you are correct, I don't know of a way of resolving it. 
 (And as you say, including other header files might continue ad 
 infinitum, which seems silly to me.)

Remind me again, what was the problem with the strptime replacement
in 3.1.5?  I know there was a y2k bug, which I fixed almost 2 years ago,
but was there anything else?  If it was because it left other fields
uninitialised, then I think we solved that too, didn't we?

All this nonsense about finding a langinfo.h for strptime, and then
finding an nl_types.h for langinfo.h, ad infinitum, is beyond silly.
Here's why: no one has stopped to question why these headers are needed
in the first place.  These are all for NLS support, which is precisely
what we DO NOT want in htdig!

The locale support in htdig deliberately sets LC_TIME handling back to the
"C" locale specifically to avoid having the time in Last-Modified headers
and other headers parsed under the rules of other locales.  We don't want
this!  So why are we fighting to crowbar an NLS-ready strptime into the
distribution when we had one that worked without all the extra baggage?

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

[htdig] Re: extra_word_characters (PR#952)

2000-11-24 Thread Gilles Detillieux


According to Tomas Frydrych ([EMAIL PROTECTED]):
 Version: 3.1.5
 
 I need to add '+' to the list of valid word characters; after doing so htdig
 will index all words that contain '+' inside, but refuses to index words that
 start with '+' (and I suspect also words that end with it).

OK, I was able to reproduce the problem after all.  I had limited my tests
before to htdig only, but the problem was in htmerge.  It gives special
meaning to lines in the db.wordlist file that begin with "+", "-", and
"!", to mark document IDs that are unchanged, discarded or superceded.
Trouble is htmerge reads the wordlist assuming a valid word would never
begin with one of these, so its test for these is too liberal.  Here's
a patch to correct the problem, so that you can add any of these three
special characters to extra_word_characters and allow words that begin
with one of them.  Apply it in the htdig-3.1.5 main source directory using
"patch -p0  this-message-file".

--- htmerge/words.cc.wordbugThu Feb 24 20:29:11 2000
+++ htmerge/words.ccFri Nov 24 09:54:27 2000
@@ -74,37 +74,40 @@ mergeWords(char *wordtmp, char *wordfile
 //
 while (fgets(buffer, sizeof(buffer), sorted))
 {
-   if (*buffer == '+')
+   //
+   // Split the line up into the word, count, location, and
+   // document id.
+   //
+   word = good_strtok(buffer, '\t');
+   pair = good_strtok(NULL, '\t');
+   if (!word || !*word || !pair || !*pair)
{
+ if (*buffer == '+')
+ {
//
// This tells us that the document hasn't changed and we
// are to reuse the old words
//
-   }
-   else if (*buffer == '-')
-   {
+ }
+ else if (*buffer == '-')
+ {
if (removeBadUrls)
{
discard_list.Add(strtok(buffer + 1, "\n"), 0);
if (verbose)
cout  "htmerge: Removing doc #"  buffer + 1  endl;
}
-   }
-   else if (*buffer == '!')
-   {
+ }
+ else if (*buffer == '!')
+ {
discard_list.Add(strtok(buffer + 1, "\n"), 0);
if (verbose)
cout  "htmerge: doc #"  buffer + 1 
" has been superceeded."  endl;
+ }
}
else
{
-   //
-   // Split the line up into the word, count, location, and
-   // document id.
-   //
-   word = good_strtok(buffer, '\t');
-   pair = good_strtok(NULL, '\t');
wr.Clear();   // Reset count to 1, anchor to 0, and all that
sid = "-";
while (pair  *pair)


-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

[htdig] Re: valid_punctuation setting (was: extra_word_characters (PR#952))

2000-11-24 Thread Gilles Detillieux


According to Tomas Frydrych:
 I do have one question though; when defining valid_punctuation, do 
 I have to include ' ' (i.e. space), or is ' ' always included, and if I 
 have to include it explicitely, where/how do I put into in the string?

No, white space characters (space, tab, newline) are treated separately
from valid_punctuation and any other punctuation characters.  The htdig
parser uses the C library function isspace() to test if a character is
a white space character, and these are usually defined by your locale,
although with any ASCII or ISO character set these will be pretty much
the standard three characters above, and perhaps a few more obscure ones.
It would not make sense to add a space to valid_punctuation, nor can you.

The valid_punctuation characters are those that are allowed within a
compound word.  Historically, a word like "post-doctoral" was indexed
only as "postdoctoral" if the "-" was in valid_punctuation.  In more
recent versions, it is indexed as "postdoctoral", "post" and "doctoral".
But you see how valid_punctuation characters have a special meaning within
a word.  They don't cause a distinct break between words the way that any
other punctuation character would, or the way that white space would.
E.g. the comma "," is not normally included in valid_punctuation so it
always breaks words apart, while the hyphen or apostrophe can appear
within a word (in English, in any case).

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] libstdc++.so.2.10.0

2000-11-23 Thread Gilles Detillieux


According to David Robley:
 On Fri, 24 Nov 2000, NSWPS Intranet Project wrote:
  Recieve following on rundig :
  
  intranet02 # rundig
  ld.so.1: /usr/pkgs/www/bin/htdig: fatal: libstdc++.so.2.10.0: open
  failed: No su
  ch file or directory
  Killed
  ld.so.1: /usr/pkgs/www/bin/htmerge: fatal: libstdc++.so.2.10.0: open
  failed: No
  such file or directory
  Killed
  ld.so.1: /usr/pkgs/www/bin/htnotify: fatal: libstdc++.so.2.10.0: open
  failed: No
   such file or directory
  Killed
  ld.so.1: /usr/pkgs/www/bin/htfuzzy: fatal: libstdc++.so.2.10.0: open
  failed: No
  such file or directory
  Killed
  ld.so.1: /usr/pkgs/www/bin/htfuzzy: fatal: libstdc++.so.2.10.0: open
  failed: No
  such file or directory
  Killed
  
  PLEASE HELP!!!
  
  Sean
 What OS are you running (before someone else asks)?

I'd bet it's Solaris, although the problem may occur on other platforms,
and the solution is likely the same.

Please see http://www.htdig.org/FAQ.html#q3.6

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Does htmerge remove URL from database ?

2000-11-22 Thread Gilles Detillieux


According to Olivier Korn:
 3. Once a week, htdig is called on each site with "htdig -i -c site1.conf" 
 then "htdig -i -c site2.conf", (and so on.)
 
 4. After all the sites have been htdigged, I run htmerge in sequence in 
 order to merge all the small databases into one.
 First call is "htmerge -c site1.conf", subsequents call are "htmerge -c 
 site1.conf -m site2.conf", "htmerge -c site1.conf -m site3.conf", (and so on.)
...
 2. Now let's hear the amazing part of my story. If I do a "htmerge -c 
 site5.conf" (notice there is no -m this time.) and if I htsearch -c 
 site5.conf with "rénovation tourisme" my document is said to be found ! 
 Said in another way, the document was indexed but was certainly ripped out 
 when merging with another database.

I think after each separate htdig -i -c site#.conf you should run a
separate htmerge -c site#.conf, not just on the first site, before you
merge everything together.  Try that and see if it solves the problem.
I think the intention was that these extra merges should not have been
necessary, but this has come up before, and I think there's a problem
with merging multiple DBs when they haven't already been cleaned up by
a simple htmerge.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] different search results

2000-11-20 Thread Gilles Detillieux


According to Geoff Hutchison:
 On Mon, 20 Nov 2000, David Adams wrote:
  Or does one really get a link which when followed brings up the .PDF
  document open at the relevant page?  If so, that would be quite something,
  especially if it worked for a range of browsers.  What would be the correct
  HTML a name="..." tags for the anchors?
 
 This is on the right track. Basically, you can pass along information to
 Acrobat to open to a particular page. So AFAIK, it works with all browsers
 that support the Acrobat PDF plugin.

Pierre Olivier discussed the technique some months ago on this list, and
has a web page that describes it.  I forget the URL, but you'll find it
quickly with a Google.com search for "pdftodig".  There's also a little
script that implements the same capability in xpdf, for locating the right
page.  The technique involves using a cgi script URL in the anchor tag,
with the cgi script spitting out some XML for Acrobat.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] Redirection of Htdig output -- 3.20b2

2000-11-17 Thread Gilles Detillieux


According to [EMAIL PROTECTED]:
 Following line of Perl code is intended to run htdig, and send STDOUT to
 /htdig3.2b2/autoshop-online._htdig.log;
   
 system 
 "/htdig3.2b2/bin/htdig","-svic","/htdig3.2b2/sngl/conf/autoshop-online.conf","
 
 ",
 "/htdig3.2b2/autoshop-online._htdig.log";
 
 The execution of Htdig produces valid content in STDOUT, but it goes to 
 STDOUT itself (as opposed to the specified file).  Best I can tell, from 
 review of Perl (5.005_03) documentation, syntax of above command is valid.

While I'm no Perl expert, I've never seen "system" used in this way.  I think

  system("/htdig3.2b2/bin/htdig -svic /htdig3.2b2/sngl/conf/autoshop-online.conf  
/htdig3.2b2/autoshop-online._htdig.log");

will do what you want.  The string just gets passed to the shell for
parsing, as far as I know, so you use standard sh/ksh/bash syntax in
the string.  Perhaps in the syntax you used, the "" got passed literally
as argument 3 to the htdig program.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] SSL Patch

2000-11-17 Thread Gilles Detillieux


According to Michael Arndt:
 When applying SSL.0 or SSL.2 (SSL.1 doesnt apply) to a htdig 3.1.5 fresh
 from Server, i get Problems when trying to compile on a linux box:
...
 Server.cc: In method `Server::Server(char *, int, int, StringList * = 0)':
 Server.cc:44: passing `const char *' as argument 1 of `String::operator =(char *)' 
discards qualifiers

Try replacing lines 43-44 of the patched Server.cc with the following
construct to see if it would keep your compiler happy:

String  url = "http://";
if (ssl) url = "https://";

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

Re: [htdig] ssl patch for ht://dig

2000-11-17 Thread Gilles Detillieux


According to Jeremy Lyon:
 Gilles,
 
 Thank you so much.  That worked.  Now I have a new problem.  I am indexing from the 
local file system.  Now
 when I do a search everything works fine except the urls that are returned for the 
ssl sites appear like
 this.
 
 http://ecom.uswest:443/path
 
 It's storing as a regular http:// instead of https:// and it's cutting off the .com. 
  Any ideas.

Not a clue, but then I haven't had a good long look at the SSL patch to see
what it's doing.  You should probably ask the developer of the orginal SSL
patch (for 3.1.3, I think), as the current one is supposedly a straight port
of it to 3.1.5.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  http://www.htdig.org/mail/menu.html
FAQ:http://www.htdig.org/FAQ.html

1 2 3 4 5 6 7 8 9 10 >

1 - 100 of 924 matches

Mail list logo