Re: [htdig] Htmerge: Deleted, invalid

2000-07-25 Thread D . J . Adams

 
 According to David Adams:
  I have been using htdig (3.1.2 and then 3.1.5) on an IRIX system for about a 
  year and I have been very pleased with it.  I would say that we've given it a 
  good workout here.  The problem with the "Deleted, invalid" messages only 
  occurs with a second, relatively new search index.
 
 I guess I should have read your message before responding to Geoff's!
 
  The first index is made from a single run of htdig covering 33 servers, all in 
  the local domain, and on this week's initial dig htmerge reports 49,233 
  documents and not a single "Deleted, invalid".
  
  The second index is made from two runs of htdig covering a total 969 (yes 969 
  !) servers using a proxy.  Htmerge reports a mere 3,096 documents and 86 
  "Deleted, invalid".
  
  I have looked at the db.wordlist files (which are written to only by htdig - is 
  that right?)
 
 Yes and no.  htdig creates and writes the initial db.wordlist, then htmerge
 sorts it, merges words together, and processes flags for page removals.  It
 then rewrites this file before creating the word index database.
 
  and it would appear that htdig is flagging the pages for htmerge 
  to delete and is not finding any words in them.
  
  I can advance these theories:
  
  It is not a bug, but is due to the use of a proxy. (I use a proxy 
  because without one, a portion of the sites on any run of htdig were 
  found to be not responding or even unknown.  With a proxy, htdig appears
  to have no such problems.)
 
 Hold on there!  The problem of sites being down (unknown or not
 responding) is exactly the sort of thing that causes the "Deleted,
 invalid" situation, and I said so last week.  How did you conclude that
 htdig appears to have no such problems with a proxy, when it does indeed
 appear to be having exactly that problem?  It would make sense that if
 a site is not responding, the proxy would inform htdig of this (unless
 it happened to quietly substitute a cached copy of the requested page
 - assuming it had one), and htdig would respond the same way it would
 without a proxy.  I think this is the most likely theory.

How did I conclude that htdig is having no such problems?
Two reasons: 
1). At least one page on our main server, covered by my
http_proxy_exclude statement, is "Deleted, invalid".
2). When I do not use http_proxy then htdig -v gives clear
messages, such as "Unable to connect to server" and
"Server not responding".
With http_proxy I get no such messages, not even with htdig -vvv

Additionally:
3). I can access the pages using IE (same proxy) the same day,
no problem. 
4). One or two pages from a site may be affected while others
are not.

I have now re-run the index with htdig -i -vvv etc.  I have rather a lot of 
information to go through, but I've found nothing yet.

And that nothing is significant.  What do you make of this, the log from htmerge
includes:

Deleted, invalid: 2200/http://www.folkmania.org.uk/LeeZachinfo.htm

While the log from htdig includes this (slightly mangled by "more" command), which 
looks OK to me:

pick: www.folkmania.org.uk, # servers = 246
1226:895:2:http://www.folkmania.org.uk/LeeZachinfo.htm: Retrieval command for 
http://www.folkmania.org.uk/LeeZachinfo.htm: GET http://www.folkmania.org.uk/Lee
Zachinfo.htm HTTP/1.0
User-Agent: htdig/3.1.5 ([EMAIL PROTECTED])
Referer: http://www.folkmania.org.uk/
Host: www.folkmania.org.uk

Header line: HTTP/1.0 200 OK
Header line: Server: thttpd/2.07 02dec99
Header line: Content-Type: text/html
Header line: Date: Mon, 24 Jul 2000 03:35:01 GMT
Header line: Last-Modified: Fri, 23 Jun 2000 18:34:50 GMT
Translated Fri, 23 Jun 2000 18:34:50 GMT to 2000-06-23 18:34:50 (100)
And converted to Fri, 23 Jun 2000 18:34:50
Header line: Accept-Ranges: bytes
Header line: Content-Length: 4586
Header line: Age: 127170
Header line: X-Cache: HIT from www-cacheb.soton.ac.uk
Header line: X-Cache-Lookup: HIT from www-cacheb.soton.ac.uk:3128
Header line: X-Cache: MISS from www-cachea.soton.ac.uk
Header line: X-Cache-Lookup: MISS from www-cachea.soton.ac.uk:3128
Header line: Proxy-Connection: close
Header line: 
returnStatus = 0
Read 4586 from document
Read a total of 4586 bytes

title: LeeZachInfo
[snip]
 size = 4586

And that page is only retrieved once.

 
  It is a bug due to the use of a proxy.
  
  It is a bug which only shows when compiled under IRIX.
  
  It is a bug which only occurs when there many different servers.
  

I can add another theory:

It is a bug when merging a second index
 - all the "Deleted, invalid" pages come from the htdig run specified
   with the htmerge -m option

This theory is easy to check out, I'll investigate tomorrow.


  I intend to re-build the second index using htdig -vvv and perhaps learn 
  something.
 
 The only sure way to rule 

Re: [htdig] Htmerge: Deleted, invalid

2000-07-25 Thread Gilles Detillieux

According to [EMAIL PROTECTED]:
 How did I conclude that htdig is having no such problems?
 Two reasons: 
   1). At least one page on our main server, covered by my
   http_proxy_exclude statement, is "Deleted, invalid".

OK, so would suggest the problem isn't limited to proxies.

   2). When I do not use http_proxy then htdig -v gives clear
   messages, such as "Unable to connect to server" and
   "Server not responding".
   With http_proxy I get no such messages, not even with htdig -vvv
 
 Additionally:
   3). I can access the pages using IE (same proxy) the same day,
   no problem. 
   4). One or two pages from a site may be affected while others
   are not.

Right, you did mention these two points much earlier.  I was forgetting about
that.

 I have now re-run the index with htdig -i -vvv etc.  I have rather a lot of 
 information to go through, but I've found nothing yet.
 
 And that nothing is significant.  What do you make of this, the log from htmerge
 includes:
 
 Deleted, invalid: 2200/http://www.folkmania.org.uk/LeeZachinfo.htm
 
 While the log from htdig includes this (slightly mangled by "more" command), which 
looks OK to me:
 
 pick: www.folkmania.org.uk, # servers = 246
 1226:895:2:http://www.folkmania.org.uk/LeeZachinfo.htm: Retrieval command for 
http://www.folkmania.org.uk/LeeZachinfo.htm: GET http://www.folkmania.org.uk/Lee
 Zachinfo.htm HTTP/1.0
 User-Agent: htdig/3.1.5 ([EMAIL PROTECTED])
 Referer: http://www.folkmania.org.uk/
 Host: www.folkmania.org.uk
 
 Header line: HTTP/1.0 200 OK
 Header line: Server: thttpd/2.07 02dec99
 Header line: Content-Type: text/html
 Header line: Date: Mon, 24 Jul 2000 03:35:01 GMT
 Header line: Last-Modified: Fri, 23 Jun 2000 18:34:50 GMT
 Translated Fri, 23 Jun 2000 18:34:50 GMT to 2000-06-23 18:34:50 (100)
 And converted to Fri, 23 Jun 2000 18:34:50
 Header line: Accept-Ranges: bytes
 Header line: Content-Length: 4586
 Header line: Age: 127170
 Header line: X-Cache: HIT from www-cacheb.soton.ac.uk
 Header line: X-Cache-Lookup: HIT from www-cacheb.soton.ac.uk:3128
 Header line: X-Cache: MISS from www-cachea.soton.ac.uk
 Header line: X-Cache-Lookup: MISS from www-cachea.soton.ac.uk:3128
 Header line: Proxy-Connection: close
 Header line: 
 returnStatus = 0
 Read 4586 from document
 Read a total of 4586 bytes
 
 title: LeeZachInfo
 [snip]
  size = 4586

Hmm, you snipped just as it was getting interesting.  I assume that there
were lots of entries for words being indexed, tags being parsed, and such?

 I can add another theory:
 
   It is a bug when merging a second index
- all the "Deleted, invalid" pages come from the htdig run specified
  with the htmerge -m option
 
 This theory is easy to check out, I'll investigate tomorrow.

OK, this brings a question to mind.  Did you run htmerge separately
on each of the two databases created by the htdig runs, before running
htmerge to merge the two databases together?  I think that, as a minimum,
you must run htmerge after htdig to clean up the database before using
it as the -m option for a merge.  You may have to clean up the target
database too - I'm not completely certain about that, but I know it
can't hurt.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Htmerge: Deleted, invalid

2000-07-25 Thread Geoff Hutchison

On Tue, 25 Jul 2000, Gilles Detillieux wrote:

  It is a bug when merging a second index
   - all the "Deleted, invalid" pages come from the htdig run specified
 with the htmerge -m option
  
  This theory is easy to check out, I'll investigate tomorrow.
 
 you must run htmerge after htdig to clean up the database before using
 it as the -m option for a merge.  You may have to clean up the target
 database too - I'm not completely certain about that, but I know it
 can't hurt.

You're right that it can't hurt. However, the code should work fine
without doing this. However, it would certainly help the issue at hand if
we knew that these were messages that didn't occur if you ran htmerge on
just the database and *did* occur if you ran htmerge after merging.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/




To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Htmerge: Deleted, invalid

2000-07-24 Thread David Adams

Quoting Gilles Detillieux [EMAIL PROTECTED]:

 According to David Adams:
  I use the standard MIPSpro compiler.  The script I use (thanks to my
 former 
  collegeaue James Hammick) to setup the Makefile is:
  
  #!/bin/sh
  CFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CFLAGS
  CPPFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CPPFLAGS
  LDFLAGS="-mips4 -L/usr/lib32 -rpath /opt/local/htdig-3.1.5/lib";
  export LDFLAGS
  ./configure --prefix=/opt/local/htdig-3.1.5 \
--with-cgi-bin-dir=/opt/local/htdig-3.1.5/cgi-bin \
--with-image-dir=/opt/local/htdig-3.1.5/graphics \
--with-search-dir=/opt/local/htdig-3.1.5/htdocs/sample
  
  A lot of that is site-specific, and the "-rpath directory" option is
 only
  needed because the compression library is not in a standard place on the 
  machine on which htdig is run.
  
  The "-woff all" option suppresses most warning messages.  I will remove
 it,
  recompile htdig and send the result directly to Gilles, it might contain a
 clue.
 
 As Sinclair mentioned, 'you need to have the 2.95.2 gcc and the latest
 gnu "make".'  I don't know that anyone has ever gotten ht://Dig to work
 with SGI's own compiler.  If fact, we got a lot of reports from folks
 who couldn't even get it to compile.
 
 If you're really determined to get to the bottom of this and make it work
 with the SGI compiler, I wish you well, but I doubt I can help much.
 I looked at the output you sent me, and didn't really see any red
 flags pointing to an obvious problem.  I know that the Serialize and
 Deserialize functions for the db.docdb records can be a tad finicky, so
 that would probably be a place to look.  There could also be problems
 with incorrect assumptions about word sizes, e.g. if the SGI compiler
 has 64-bit long ints.  I'd also look at the db.wordlist records (they're
 ASCII text) before and after htmerge, to see if htdig is actually telling
 htmerge to remove some of these documents, or if htmerge is deciding to
 do so on its own.
 
 For the time being, the ht://Dig code hasn't had much of a workout on
 non-GNU compilers, so it doesn't seem to do well on them.  If you can
 help remedy that, great.  If you want to get the package working as
 quickly and easily as possible, I'd suggest trying the GNU C and C++
 compilers.
 
 -- 
 Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
 Spinal Cord Research Centre   WWW:   
 http://www.scrc.umanitoba.ca/~grdetil
 Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
 Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930
 

I have been using htdig (3.1.2 and then 3.1.5) on an IRIX system for about a 
year and I have been very pleased with it.  I would say that we've given it a 
good workout here.  The problem with the "Deleted, invalid" messages only 
occurs with a second, relatively new search index.

The first index is made from a single run of htdig covering 33 servers, all in 
the local domain, and on this week's initial dig htmerge reports 49,233 
documents and not a single "Deleted, invalid".

The second index is made from two runs of htdig covering a total 969 (yes 969 
!) servers using a proxy.  Htmerge reports a mere 3,096 documents and 86 
"Deleted, invalid".

I have looked at the db.wordlist files (which are written to only by htdig - is 
that right?) and it would appear that htdig is flagging the pages for htmerge 
to delete and is not finding any words in them.

I can advance these theories:

It is not a bug, but is due to the use of a proxy. (I use a proxy 
because without one, a portion of the sites on any run of htdig were 
found to be not responding or even unknown.  With a proxy, htdig appears
to have no such problems.)

It is a bug due to the use of a proxy.

It is a bug which only shows when compiled under IRIX.

It is a bug which only occurs when there many different servers.

I intend to re-build the second index using htdig -vvv and perhaps learn 
something.

--
David Adams
[EMAIL PROTECTED]
Computing Services
Southampton University


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Htmerge: Deleted, invalid

2000-07-24 Thread Gilles Detillieux

According to Geoff Hutchison:
 
 At 10:34 AM -0500 7/19/00, Gilles Detillieux wrote:
I use the standard MIPSpro compiler.  The script I use (thanks to my former
collegeaue James Hammick) to setup the Makefile is:
 
 I have used SGI's compiler on quite a lot of code

Have you been able to build ht://Dig using SGI's compiler?  I may be
wrong, but I recall seeing several error reports from SGI users, and
the response was usually to use the GNU compiler.  I'm not saying SGI's
compiler is bad or buggy, just that I haven't heard of a successful
build of ht://Dig with it.  In all likelyhood, it would be a problem in
the ht://Dig code that just doesn't manifest itself when built with the
GNU compiler.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Htmerge: Deleted, invalid

2000-07-24 Thread Gilles Detillieux

According to David Adams:
 I have been using htdig (3.1.2 and then 3.1.5) on an IRIX system for about a 
 year and I have been very pleased with it.  I would say that we've given it a 
 good workout here.  The problem with the "Deleted, invalid" messages only 
 occurs with a second, relatively new search index.

I guess I should have read your message before responding to Geoff's!

 The first index is made from a single run of htdig covering 33 servers, all in 
 the local domain, and on this week's initial dig htmerge reports 49,233 
 documents and not a single "Deleted, invalid".
 
 The second index is made from two runs of htdig covering a total 969 (yes 969 
 !) servers using a proxy.  Htmerge reports a mere 3,096 documents and 86 
 "Deleted, invalid".
 
 I have looked at the db.wordlist files (which are written to only by htdig - is 
 that right?)

Yes and no.  htdig creates and writes the initial db.wordlist, then htmerge
sorts it, merges words together, and processes flags for page removals.  It
then rewrites this file before creating the word index database.

 and it would appear that htdig is flagging the pages for htmerge 
 to delete and is not finding any words in them.
 
 I can advance these theories:
 
 It is not a bug, but is due to the use of a proxy. (I use a proxy 
 because without one, a portion of the sites on any run of htdig were 
 found to be not responding or even unknown.  With a proxy, htdig appears
 to have no such problems.)

Hold on there!  The problem of sites being down (unknown or not
responding) is exactly the sort of thing that causes the "Deleted,
invalid" situation, and I said so last week.  How did you conclude that
htdig appears to have no such problems with a proxy, when it does indeed
appear to be having exactly that problem?  It would make sense that if
a site is not responding, the proxy would inform htdig of this (unless
it happened to quietly substitute a cached copy of the requested page
- assuming it had one), and htdig would respond the same way it would
without a proxy.  I think this is the most likely theory.

 It is a bug due to the use of a proxy.
 
 It is a bug which only shows when compiled under IRIX.
 
 It is a bug which only occurs when there many different servers.
 
 I intend to re-build the second index using htdig -vvv and perhaps learn 
 something.

The only sure way to rule out an SGI compiler or IRIX-specific problem
would be to run htdig on a Linux box with the same configuration and
the same proxy, and see if you get the same results.  However, based on
what you said about a portion of the sites not responding, I'd guess
this is a more likely problem.  I guess there could also be a problem
with the proxy server itself, causing it to act like a server is down
when it isn't.  You may want to try different proxies as well.  In any
case, a close look at htdig -vvv output should give some clues.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Htmerge: Deleted, invalid

2000-07-24 Thread Geoff Hutchison

On Mon, 24 Jul 2000, Gilles Detillieux wrote:

 Have you been able to build ht://Dig using SGI's compiler?  I may be

No.

 build of ht://Dig with it.  In all likelyhood, it would be a problem in
 the ht://Dig code that just doesn't manifest itself when built with the
 GNU compiler.

Maybe. Except I sometimes have trouble building software that's *supposed*
to work with SGI's compiler (like GCC or Emacs or CVS). Yes, I'd like to
see ht://Dig compile cleanly with various native compilers. But since
there seems to be an Ok workaround (if not ideal), I'm not personally
going to put much effort in this direction.

Then again, I haven't used SGI's compiler extensively. I admit readily
that I'd much rather compile GCC (or get binaries) and use them than fuss
with the native one.

In my group here at Northwestern, I'm not alone. Only SGI's Fortran
compiler is used, in part because there isn't a GNU compiler for anything
beyond f77.

--
-Geoff Hutchison
Williams Students Online
http://wso.williams.edu/



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Htmerge: Deleted, invalid

2000-07-22 Thread Geoff Hutchison

At 10:34 AM -0500 7/19/00, Gilles Detillieux wrote:
   I use the standard MIPSpro compiler.  The script I use (thanks to my former
   collegeaue James Hammick) to setup the Makefile is:

I have used SGI's compiler on quite a lot of code



To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Htmerge: Deleted, invalid

2000-07-19 Thread Gilles Detillieux

According to David Adams:
 I use the standard MIPSpro compiler.  The script I use (thanks to my former 
 collegeaue James Hammick) to setup the Makefile is:
 
 #!/bin/sh
 CFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CFLAGS
 CPPFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CPPFLAGS
 LDFLAGS="-mips4 -L/usr/lib32 -rpath /opt/local/htdig-3.1.5/lib";
 export LDFLAGS
 ./configure --prefix=/opt/local/htdig-3.1.5 \
   --with-cgi-bin-dir=/opt/local/htdig-3.1.5/cgi-bin \
   --with-image-dir=/opt/local/htdig-3.1.5/graphics \
   --with-search-dir=/opt/local/htdig-3.1.5/htdocs/sample
 
 A lot of that is site-specific, and the "-rpath directory" option is only
 needed because the compression library is not in a standard place on the 
 machine on which htdig is run.
 
 The "-woff all" option suppresses most warning messages.  I will remove it,
 recompile htdig and send the result directly to Gilles, it might contain a clue.

As Sinclair mentioned, 'you need to have the 2.95.2 gcc and the latest
gnu "make".'  I don't know that anyone has ever gotten ht://Dig to work
with SGI's own compiler.  If fact, we got a lot of reports from folks
who couldn't even get it to compile.

If you're really determined to get to the bottom of this and make it work
with the SGI compiler, I wish you well, but I doubt I can help much.
I looked at the output you sent me, and didn't really see any red
flags pointing to an obvious problem.  I know that the Serialize and
Deserialize functions for the db.docdb records can be a tad finicky, so
that would probably be a place to look.  There could also be problems
with incorrect assumptions about word sizes, e.g. if the SGI compiler
has 64-bit long ints.  I'd also look at the db.wordlist records (they're
ASCII text) before and after htmerge, to see if htdig is actually telling
htmerge to remove some of these documents, or if htmerge is deciding to
do so on its own.

For the time being, the ht://Dig code hasn't had much of a workout on
non-GNU compilers, so it doesn't seem to do well on them.  If you can
help remedy that, great.  If you want to get the package working as
quickly and easily as possible, I'd suggest trying the GNU C and C++
compilers.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Htmerge: Deleted, invalid

2000-07-18 Thread Gilles Detillieux

According to [EMAIL PROTECTED]:
   I think there is a bug in htmerge 3.1.5 which causes it to declare
   some pages as "invalid" in some cases.
  
  That may be, but I want to be sure we've ruled out every other possibility
  first.  I've never seen a bug report like this, so it would be very
  unusual if it is indeed a bug showing up in your case, but not for
  other users.  If you can find a consistent test case that fails on
  an initial dig, please provide details on your OS, version, config,
  etc. so that we can look into this further.
  
 
 IRIX 6.5, Htdig 3.1.5
 
 One of the symptoms is that there is no consistency.  Today's re-index
 reported 84 pages to be invalid.  Of these only one was from the
 http://www.tregalic.co.uk/sacred-heart/ site, and this time it was
 churchpage7.html.  And that page is *NOT* found by any search on my index,
 though I can follow links to it from other pages and browse it.
 
 I don't see how you can investigate this yet, but unless people put in
 reports like mine you will always be able to claim the "no-one else
 is having this problem". 
 
 I will continue to look for a pattern which might give a clue. 

I'm inclined to think this is a platform-specific problem.  Most of
the trouble reports we've seen about IRIX systems are from users who
can't even get htdig compiled, let alone running, so I don't think the
package has had a thorough workout under IRIX.  Which compier did you
use to build it?

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Htmerge: Deleted, invalid

2000-07-18 Thread David Adams

Quoting Gilles Detillieux [EMAIL PROTECTED]:

  
  IRIX 6.5, Htdig 3.1.5
  
  One of the symptoms is that there is no consistency.  Today's re-index
  reported 84 pages to be invalid.  Of these only one was from the
  http://www.tregalic.co.uk/sacred-heart/ site, and this time it was
  churchpage7.html.  And that page is *NOT* found by any search on my index,
  though I can follow links to it from other pages and browse it.
  
  I don't see how you can investigate this yet, but unless people put in
  reports like mine you will always be able to claim the "no-one else
  is having this problem". 
  
  I will continue to look for a pattern which might give a clue. 
 
 I'm inclined to think this is a platform-specific problem.  Most of
 the trouble reports we've seen about IRIX systems are from users who
 can't even get htdig compiled, let alone running, so I don't think the
 package has had a thorough workout under IRIX.  Which compier did you
 use to build it?
 
 -- 
 Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
 Spinal Cord Research Centre   WWW:   
 http://www.scrc.umanitoba.ca/~grdetil
 Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
 Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930
 

That is a possibilty worth pursuing.

I use the standard MIPSpro compiler.  The script I use (thanks to my former 
collegeaue James Hammick) to setup the Makefile is:

#!/bin/sh
CFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CFLAGS
CPPFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CPPFLAGS
LDFLAGS="-mips4 -L/usr/lib32 -rpath /opt/local/htdig-3.1.5/lib";
export LDFLAGS
./configure --prefix=/opt/local/htdig-3.1.5 \
  --with-cgi-bin-dir=/opt/local/htdig-3.1.5/cgi-bin \
  --with-image-dir=/opt/local/htdig-3.1.5/graphics \
  --with-search-dir=/opt/local/htdig-3.1.5/htdocs/sample

A lot of that is site-specific, and the "-rpath directory" option is only
needed because the compression library is not in a standard place on the 
machine on which htdig is run.

The "-woff all" option suppresses most warning messages.  I will remove it,
recompile htdig and send the result directly to Gilles, it might contain a clue.

--
David Adams
[EMAIL PROTECTED]
Computing Services
Southampton University


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Htmerge: Deleted, invalid

2000-07-14 Thread D . J . Adams

Sorry for the length of this!

 
 According to David Adams:
  Why does htmerge 3.1.5 flag some pages, which look OK to me, as 
  "Deleted, invalid" and not index them?
  
  This is happening not just with .html pages but also .doc and .pdf files.
  
  It happens with a simple merge following a run of htdig -i -a
  and also when two htdig runs are merged using the htdig -m option.
 
 htmerge does this when the remove_bad_urls attribute is true, and the
 page in question is not found (404 error), the server name no longer
 exists, the server is down, or in the case of an update dig, the page
 has been updated, superceding the old document database record for it.
 In the latter case, htdig creates a new record for the updated document,
 with a new DocID, so the old one is discarded.  As this only happens in
 update digs, it wouldn't be the case during an htdig -i, so I'd look at
 the other possibilities.
 
 In any case, run both htdig and htmerge with at least two verbose options,
 and cross-reference the DocID of the "Deleted, invalid" messages to other
 messages with the same ID, to get a clearer picture of what's happening.
 
 -- 
 Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
 Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
 Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
 Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930
 
 

I've run htdig -vv followed by htmerge -vvv and I still cannot see
any reason why htmerge decides, apparently arbitrarily, that a page is
invalid.  None of the reasons given above seem to fit.

I'll take a single example: http://www.tregalic.co.uk/sacred-heart/, is
one of many in the limit_urls_to directive. 

Htdig finds http://www.tregalic.co.uk/sacred-heart/ and then
http://www.tregalic.co.uk/sacred-heart/churchpage1.html
http://www.tregalic.co.uk/sacred-heart/churchpage2.html
  ...
http://www.tregalic.co.uk/sacred-heart/churchpage7.html
amongst others.

Grepping for "churchpage" in the htmerge log I find:

htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage1.html
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage2.html
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage3.html
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage4.html
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage5.html
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage6.html
htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage7.html
1897/http://www.tregalic.co.uk/sacred-heart/churchpage1.html
1898/http://www.tregalic.co.uk/sacred-heart/churchpage2.html
1899/http://www.tregalic.co.uk/sacred-heart/churchpage3.html
Deleted, invalid: 1900/http://www.tregalic.co.uk/sacred-heart/churchpage4.html
Deleted, invalid: 1901/http://www.tregalic.co.uk/sacred-heart/churchpage5.html
1902/http://www.tregalic.co.uk/sacred-heart/churchpage6.html
1903/http://www.tregalic.co.uk/sacred-heart/churchpage7.html

So I try an experiment: I reduce limit_urls_to include only the starting URL
and http://www.tregalic.co.uk/sacred-heart/ and run htdig  htmerge.

Then htmerge reports:

htmerge: Total word count: 3806
0/http://www.soton.ac.uk/services/local/alpha.html
1/http://www.tregalic.co.uk/sacred-heart/
9/http://www.tregalic.co.uk/sacred-heart/baptism.html
2/http://www.tregalic.co.uk/sacred-heart/churchpage1.html
3/http://www.tregalic.co.uk/sacred-heart/churchpage2.html
4/http://www.tregalic.co.uk/sacred-heart/churchpage3.html
5/http://www.tregalic.co.uk/sacred-heart/churchpage4.html
6/http://www.tregalic.co.uk/sacred-heart/churchpage5.html
7/http://www.tregalic.co.uk/sacred-heart/churchpage6.html
8/http://www.tregalic.co.uk/sacred-heart/churchpage7.html
htmerge: 10
12/http://www.tregalic.co.uk/sacred-heart/information.html
11/http://www.tregalic.co.uk/sacred-heart/links.html
10/http://www.tregalic.co.uk/sacred-heart/newsletter.html

I do not accept that pages 4  5 just happened to unavailable on the
first occasion and available on the second.  Nor can I see any
differences in the htdig logs for these pages.  The same sizes are
reported in both cases. 

I think there is a bug in htmerge 3.1.5 which causes it to declare
some pages as "invalid" in some cases.

-- 
 
David J Adams
[EMAIL PROTECTED]
Computing Services
University of Southampton


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




Re: [htdig] Htmerge: Deleted, invalid

2000-07-14 Thread Gilles Detillieux

According to [EMAIL PROTECTED]:
 I've run htdig -vv followed by htmerge -vvv and I still cannot see
 any reason why htmerge decides, apparently arbitrarily, that a page is
 invalid.  None of the reasons given above seem to fit.

If you run htdig -vv, without a -i option, and you have an existing
database, then htdig will run an update dig, not an initial dig, so it's
possible that it will reindex churchpage4.html and churchpage5.html,
but not the others.  Are you certain that these two pages don't appear
elsewhere in the htdig or htmerge logs, or for that matter that you're
starting out without an existing database?

 So I try an experiment: I reduce limit_urls_to include only the starting URL
 and http://www.tregalic.co.uk/sacred-heart/ and run htdig  htmerge.
...
 I do not accept that pages 4  5 just happened to unavailable on the
 first occasion and available on the second.  Nor can I see any
 differences in the htdig logs for these pages.  The same sizes are
 reported in both cases. 

If there's an existing database in the first case, but not the second,
that may be the cause of the discrepancy.  To be certain, use the -i
option to htdig in all test cases, and let us know if it still finds
these two pages as "invalid".

 I think there is a bug in htmerge 3.1.5 which causes it to declare
 some pages as "invalid" in some cases.

That may be, but I want to be sure we've ruled out every other possibility
first.  I've never seen a bug report like this, so it would be very
unusual if it is indeed a bug showing up in your case, but not for
other users.  If you can find a consistent test case that fails on
an initial dig, please provide details on your OS, version, config,
etc. so that we can look into this further.

-- 
Gilles R. Detillieux  E-mail: [EMAIL PROTECTED]
Spinal Cord Research Centre   WWW:http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:(204)789-3930


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.




[htdig] Htmerge: Deleted, invalid

2000-07-12 Thread David Adams

Why does htmerge 3.1.5 flag some pages, which look OK to me, as 
"Deleted, invalid" and not index them?

This is happening not just with .html pages but also .doc and .pdf files.

It happens with a simple merge following a run of htdig -i -a
and also when two htdig runs are merged using the htdig -m option.

David Adams
[EMAIL PROTECTED]
Computing Services
Southampton University


To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.