Re: [htdig] Htmerge: Deleted, invalid
According to David Adams: I have been using htdig (3.1.2 and then 3.1.5) on an IRIX system for about a year and I have been very pleased with it. I would say that we've given it a good workout here. The problem with the "Deleted, invalid" messages only occurs with a second, relatively new search index. I guess I should have read your message before responding to Geoff's! The first index is made from a single run of htdig covering 33 servers, all in the local domain, and on this week's initial dig htmerge reports 49,233 documents and not a single "Deleted, invalid". The second index is made from two runs of htdig covering a total 969 (yes 969 !) servers using a proxy. Htmerge reports a mere 3,096 documents and 86 "Deleted, invalid". I have looked at the db.wordlist files (which are written to only by htdig - is that right?) Yes and no. htdig creates and writes the initial db.wordlist, then htmerge sorts it, merges words together, and processes flags for page removals. It then rewrites this file before creating the word index database. and it would appear that htdig is flagging the pages for htmerge to delete and is not finding any words in them. I can advance these theories: It is not a bug, but is due to the use of a proxy. (I use a proxy because without one, a portion of the sites on any run of htdig were found to be not responding or even unknown. With a proxy, htdig appears to have no such problems.) Hold on there! The problem of sites being down (unknown or not responding) is exactly the sort of thing that causes the "Deleted, invalid" situation, and I said so last week. How did you conclude that htdig appears to have no such problems with a proxy, when it does indeed appear to be having exactly that problem? It would make sense that if a site is not responding, the proxy would inform htdig of this (unless it happened to quietly substitute a cached copy of the requested page - assuming it had one), and htdig would respond the same way it would without a proxy. I think this is the most likely theory. How did I conclude that htdig is having no such problems? Two reasons: 1). At least one page on our main server, covered by my http_proxy_exclude statement, is "Deleted, invalid". 2). When I do not use http_proxy then htdig -v gives clear messages, such as "Unable to connect to server" and "Server not responding". With http_proxy I get no such messages, not even with htdig -vvv Additionally: 3). I can access the pages using IE (same proxy) the same day, no problem. 4). One or two pages from a site may be affected while others are not. I have now re-run the index with htdig -i -vvv etc. I have rather a lot of information to go through, but I've found nothing yet. And that nothing is significant. What do you make of this, the log from htmerge includes: Deleted, invalid: 2200/http://www.folkmania.org.uk/LeeZachinfo.htm While the log from htdig includes this (slightly mangled by "more" command), which looks OK to me: pick: www.folkmania.org.uk, # servers = 246 1226:895:2:http://www.folkmania.org.uk/LeeZachinfo.htm: Retrieval command for http://www.folkmania.org.uk/LeeZachinfo.htm: GET http://www.folkmania.org.uk/Lee Zachinfo.htm HTTP/1.0 User-Agent: htdig/3.1.5 ([EMAIL PROTECTED]) Referer: http://www.folkmania.org.uk/ Host: www.folkmania.org.uk Header line: HTTP/1.0 200 OK Header line: Server: thttpd/2.07 02dec99 Header line: Content-Type: text/html Header line: Date: Mon, 24 Jul 2000 03:35:01 GMT Header line: Last-Modified: Fri, 23 Jun 2000 18:34:50 GMT Translated Fri, 23 Jun 2000 18:34:50 GMT to 2000-06-23 18:34:50 (100) And converted to Fri, 23 Jun 2000 18:34:50 Header line: Accept-Ranges: bytes Header line: Content-Length: 4586 Header line: Age: 127170 Header line: X-Cache: HIT from www-cacheb.soton.ac.uk Header line: X-Cache-Lookup: HIT from www-cacheb.soton.ac.uk:3128 Header line: X-Cache: MISS from www-cachea.soton.ac.uk Header line: X-Cache-Lookup: MISS from www-cachea.soton.ac.uk:3128 Header line: Proxy-Connection: close Header line: returnStatus = 0 Read 4586 from document Read a total of 4586 bytes title: LeeZachInfo [snip] size = 4586 And that page is only retrieved once. It is a bug due to the use of a proxy. It is a bug which only shows when compiled under IRIX. It is a bug which only occurs when there many different servers. I can add another theory: It is a bug when merging a second index - all the "Deleted, invalid" pages come from the htdig run specified with the htmerge -m option This theory is easy to check out, I'll investigate tomorrow. I intend to re-build the second index using htdig -vvv and perhaps learn something. The only sure way to rule
Re: [htdig] Htmerge: Deleted, invalid
According to [EMAIL PROTECTED]: How did I conclude that htdig is having no such problems? Two reasons: 1). At least one page on our main server, covered by my http_proxy_exclude statement, is "Deleted, invalid". OK, so would suggest the problem isn't limited to proxies. 2). When I do not use http_proxy then htdig -v gives clear messages, such as "Unable to connect to server" and "Server not responding". With http_proxy I get no such messages, not even with htdig -vvv Additionally: 3). I can access the pages using IE (same proxy) the same day, no problem. 4). One or two pages from a site may be affected while others are not. Right, you did mention these two points much earlier. I was forgetting about that. I have now re-run the index with htdig -i -vvv etc. I have rather a lot of information to go through, but I've found nothing yet. And that nothing is significant. What do you make of this, the log from htmerge includes: Deleted, invalid: 2200/http://www.folkmania.org.uk/LeeZachinfo.htm While the log from htdig includes this (slightly mangled by "more" command), which looks OK to me: pick: www.folkmania.org.uk, # servers = 246 1226:895:2:http://www.folkmania.org.uk/LeeZachinfo.htm: Retrieval command for http://www.folkmania.org.uk/LeeZachinfo.htm: GET http://www.folkmania.org.uk/Lee Zachinfo.htm HTTP/1.0 User-Agent: htdig/3.1.5 ([EMAIL PROTECTED]) Referer: http://www.folkmania.org.uk/ Host: www.folkmania.org.uk Header line: HTTP/1.0 200 OK Header line: Server: thttpd/2.07 02dec99 Header line: Content-Type: text/html Header line: Date: Mon, 24 Jul 2000 03:35:01 GMT Header line: Last-Modified: Fri, 23 Jun 2000 18:34:50 GMT Translated Fri, 23 Jun 2000 18:34:50 GMT to 2000-06-23 18:34:50 (100) And converted to Fri, 23 Jun 2000 18:34:50 Header line: Accept-Ranges: bytes Header line: Content-Length: 4586 Header line: Age: 127170 Header line: X-Cache: HIT from www-cacheb.soton.ac.uk Header line: X-Cache-Lookup: HIT from www-cacheb.soton.ac.uk:3128 Header line: X-Cache: MISS from www-cachea.soton.ac.uk Header line: X-Cache-Lookup: MISS from www-cachea.soton.ac.uk:3128 Header line: Proxy-Connection: close Header line: returnStatus = 0 Read 4586 from document Read a total of 4586 bytes title: LeeZachInfo [snip] size = 4586 Hmm, you snipped just as it was getting interesting. I assume that there were lots of entries for words being indexed, tags being parsed, and such? I can add another theory: It is a bug when merging a second index - all the "Deleted, invalid" pages come from the htdig run specified with the htmerge -m option This theory is easy to check out, I'll investigate tomorrow. OK, this brings a question to mind. Did you run htmerge separately on each of the two databases created by the htdig runs, before running htmerge to merge the two databases together? I think that, as a minimum, you must run htmerge after htdig to clean up the database before using it as the -m option for a merge. You may have to clean up the target database too - I'm not completely certain about that, but I know it can't hurt. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] Htmerge: Deleted, invalid
On Tue, 25 Jul 2000, Gilles Detillieux wrote: It is a bug when merging a second index - all the "Deleted, invalid" pages come from the htdig run specified with the htmerge -m option This theory is easy to check out, I'll investigate tomorrow. you must run htmerge after htdig to clean up the database before using it as the -m option for a merge. You may have to clean up the target database too - I'm not completely certain about that, but I know it can't hurt. You're right that it can't hurt. However, the code should work fine without doing this. However, it would certainly help the issue at hand if we knew that these were messages that didn't occur if you ran htmerge on just the database and *did* occur if you ran htmerge after merging. -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] Htmerge: Deleted, invalid
Quoting Gilles Detillieux [EMAIL PROTECTED]: According to David Adams: I use the standard MIPSpro compiler. The script I use (thanks to my former collegeaue James Hammick) to setup the Makefile is: #!/bin/sh CFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CFLAGS CPPFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CPPFLAGS LDFLAGS="-mips4 -L/usr/lib32 -rpath /opt/local/htdig-3.1.5/lib"; export LDFLAGS ./configure --prefix=/opt/local/htdig-3.1.5 \ --with-cgi-bin-dir=/opt/local/htdig-3.1.5/cgi-bin \ --with-image-dir=/opt/local/htdig-3.1.5/graphics \ --with-search-dir=/opt/local/htdig-3.1.5/htdocs/sample A lot of that is site-specific, and the "-rpath directory" option is only needed because the compression library is not in a standard place on the machine on which htdig is run. The "-woff all" option suppresses most warning messages. I will remove it, recompile htdig and send the result directly to Gilles, it might contain a clue. As Sinclair mentioned, 'you need to have the 2.95.2 gcc and the latest gnu "make".' I don't know that anyone has ever gotten ht://Dig to work with SGI's own compiler. If fact, we got a lot of reports from folks who couldn't even get it to compile. If you're really determined to get to the bottom of this and make it work with the SGI compiler, I wish you well, but I doubt I can help much. I looked at the output you sent me, and didn't really see any red flags pointing to an obvious problem. I know that the Serialize and Deserialize functions for the db.docdb records can be a tad finicky, so that would probably be a place to look. There could also be problems with incorrect assumptions about word sizes, e.g. if the SGI compiler has 64-bit long ints. I'd also look at the db.wordlist records (they're ASCII text) before and after htmerge, to see if htdig is actually telling htmerge to remove some of these documents, or if htmerge is deciding to do so on its own. For the time being, the ht://Dig code hasn't had much of a workout on non-GNU compilers, so it doesn't seem to do well on them. If you can help remedy that, great. If you want to get the package working as quickly and easily as possible, I'd suggest trying the GNU C and C++ compilers. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 I have been using htdig (3.1.2 and then 3.1.5) on an IRIX system for about a year and I have been very pleased with it. I would say that we've given it a good workout here. The problem with the "Deleted, invalid" messages only occurs with a second, relatively new search index. The first index is made from a single run of htdig covering 33 servers, all in the local domain, and on this week's initial dig htmerge reports 49,233 documents and not a single "Deleted, invalid". The second index is made from two runs of htdig covering a total 969 (yes 969 !) servers using a proxy. Htmerge reports a mere 3,096 documents and 86 "Deleted, invalid". I have looked at the db.wordlist files (which are written to only by htdig - is that right?) and it would appear that htdig is flagging the pages for htmerge to delete and is not finding any words in them. I can advance these theories: It is not a bug, but is due to the use of a proxy. (I use a proxy because without one, a portion of the sites on any run of htdig were found to be not responding or even unknown. With a proxy, htdig appears to have no such problems.) It is a bug due to the use of a proxy. It is a bug which only shows when compiled under IRIX. It is a bug which only occurs when there many different servers. I intend to re-build the second index using htdig -vvv and perhaps learn something. -- David Adams [EMAIL PROTECTED] Computing Services Southampton University To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] Htmerge: Deleted, invalid
According to Geoff Hutchison: At 10:34 AM -0500 7/19/00, Gilles Detillieux wrote: I use the standard MIPSpro compiler. The script I use (thanks to my former collegeaue James Hammick) to setup the Makefile is: I have used SGI's compiler on quite a lot of code Have you been able to build ht://Dig using SGI's compiler? I may be wrong, but I recall seeing several error reports from SGI users, and the response was usually to use the GNU compiler. I'm not saying SGI's compiler is bad or buggy, just that I haven't heard of a successful build of ht://Dig with it. In all likelyhood, it would be a problem in the ht://Dig code that just doesn't manifest itself when built with the GNU compiler. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] Htmerge: Deleted, invalid
According to David Adams: I have been using htdig (3.1.2 and then 3.1.5) on an IRIX system for about a year and I have been very pleased with it. I would say that we've given it a good workout here. The problem with the "Deleted, invalid" messages only occurs with a second, relatively new search index. I guess I should have read your message before responding to Geoff's! The first index is made from a single run of htdig covering 33 servers, all in the local domain, and on this week's initial dig htmerge reports 49,233 documents and not a single "Deleted, invalid". The second index is made from two runs of htdig covering a total 969 (yes 969 !) servers using a proxy. Htmerge reports a mere 3,096 documents and 86 "Deleted, invalid". I have looked at the db.wordlist files (which are written to only by htdig - is that right?) Yes and no. htdig creates and writes the initial db.wordlist, then htmerge sorts it, merges words together, and processes flags for page removals. It then rewrites this file before creating the word index database. and it would appear that htdig is flagging the pages for htmerge to delete and is not finding any words in them. I can advance these theories: It is not a bug, but is due to the use of a proxy. (I use a proxy because without one, a portion of the sites on any run of htdig were found to be not responding or even unknown. With a proxy, htdig appears to have no such problems.) Hold on there! The problem of sites being down (unknown or not responding) is exactly the sort of thing that causes the "Deleted, invalid" situation, and I said so last week. How did you conclude that htdig appears to have no such problems with a proxy, when it does indeed appear to be having exactly that problem? It would make sense that if a site is not responding, the proxy would inform htdig of this (unless it happened to quietly substitute a cached copy of the requested page - assuming it had one), and htdig would respond the same way it would without a proxy. I think this is the most likely theory. It is a bug due to the use of a proxy. It is a bug which only shows when compiled under IRIX. It is a bug which only occurs when there many different servers. I intend to re-build the second index using htdig -vvv and perhaps learn something. The only sure way to rule out an SGI compiler or IRIX-specific problem would be to run htdig on a Linux box with the same configuration and the same proxy, and see if you get the same results. However, based on what you said about a portion of the sites not responding, I'd guess this is a more likely problem. I guess there could also be a problem with the proxy server itself, causing it to act like a server is down when it isn't. You may want to try different proxies as well. In any case, a close look at htdig -vvv output should give some clues. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] Htmerge: Deleted, invalid
On Mon, 24 Jul 2000, Gilles Detillieux wrote: Have you been able to build ht://Dig using SGI's compiler? I may be No. build of ht://Dig with it. In all likelyhood, it would be a problem in the ht://Dig code that just doesn't manifest itself when built with the GNU compiler. Maybe. Except I sometimes have trouble building software that's *supposed* to work with SGI's compiler (like GCC or Emacs or CVS). Yes, I'd like to see ht://Dig compile cleanly with various native compilers. But since there seems to be an Ok workaround (if not ideal), I'm not personally going to put much effort in this direction. Then again, I haven't used SGI's compiler extensively. I admit readily that I'd much rather compile GCC (or get binaries) and use them than fuss with the native one. In my group here at Northwestern, I'm not alone. Only SGI's Fortran compiler is used, in part because there isn't a GNU compiler for anything beyond f77. -- -Geoff Hutchison Williams Students Online http://wso.williams.edu/ To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] Htmerge: Deleted, invalid
At 10:34 AM -0500 7/19/00, Gilles Detillieux wrote: I use the standard MIPSpro compiler. The script I use (thanks to my former collegeaue James Hammick) to setup the Makefile is: I have used SGI's compiler on quite a lot of code To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] Htmerge: Deleted, invalid
According to David Adams: I use the standard MIPSpro compiler. The script I use (thanks to my former collegeaue James Hammick) to setup the Makefile is: #!/bin/sh CFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CFLAGS CPPFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CPPFLAGS LDFLAGS="-mips4 -L/usr/lib32 -rpath /opt/local/htdig-3.1.5/lib"; export LDFLAGS ./configure --prefix=/opt/local/htdig-3.1.5 \ --with-cgi-bin-dir=/opt/local/htdig-3.1.5/cgi-bin \ --with-image-dir=/opt/local/htdig-3.1.5/graphics \ --with-search-dir=/opt/local/htdig-3.1.5/htdocs/sample A lot of that is site-specific, and the "-rpath directory" option is only needed because the compression library is not in a standard place on the machine on which htdig is run. The "-woff all" option suppresses most warning messages. I will remove it, recompile htdig and send the result directly to Gilles, it might contain a clue. As Sinclair mentioned, 'you need to have the 2.95.2 gcc and the latest gnu "make".' I don't know that anyone has ever gotten ht://Dig to work with SGI's own compiler. If fact, we got a lot of reports from folks who couldn't even get it to compile. If you're really determined to get to the bottom of this and make it work with the SGI compiler, I wish you well, but I doubt I can help much. I looked at the output you sent me, and didn't really see any red flags pointing to an obvious problem. I know that the Serialize and Deserialize functions for the db.docdb records can be a tad finicky, so that would probably be a place to look. There could also be problems with incorrect assumptions about word sizes, e.g. if the SGI compiler has 64-bit long ints. I'd also look at the db.wordlist records (they're ASCII text) before and after htmerge, to see if htdig is actually telling htmerge to remove some of these documents, or if htmerge is deciding to do so on its own. For the time being, the ht://Dig code hasn't had much of a workout on non-GNU compilers, so it doesn't seem to do well on them. If you can help remedy that, great. If you want to get the package working as quickly and easily as possible, I'd suggest trying the GNU C and C++ compilers. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] Htmerge: Deleted, invalid
According to [EMAIL PROTECTED]: I think there is a bug in htmerge 3.1.5 which causes it to declare some pages as "invalid" in some cases. That may be, but I want to be sure we've ruled out every other possibility first. I've never seen a bug report like this, so it would be very unusual if it is indeed a bug showing up in your case, but not for other users. If you can find a consistent test case that fails on an initial dig, please provide details on your OS, version, config, etc. so that we can look into this further. IRIX 6.5, Htdig 3.1.5 One of the symptoms is that there is no consistency. Today's re-index reported 84 pages to be invalid. Of these only one was from the http://www.tregalic.co.uk/sacred-heart/ site, and this time it was churchpage7.html. And that page is *NOT* found by any search on my index, though I can follow links to it from other pages and browse it. I don't see how you can investigate this yet, but unless people put in reports like mine you will always be able to claim the "no-one else is having this problem". I will continue to look for a pattern which might give a clue. I'm inclined to think this is a platform-specific problem. Most of the trouble reports we've seen about IRIX systems are from users who can't even get htdig compiled, let alone running, so I don't think the package has had a thorough workout under IRIX. Which compier did you use to build it? -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] Htmerge: Deleted, invalid
Quoting Gilles Detillieux [EMAIL PROTECTED]: IRIX 6.5, Htdig 3.1.5 One of the symptoms is that there is no consistency. Today's re-index reported 84 pages to be invalid. Of these only one was from the http://www.tregalic.co.uk/sacred-heart/ site, and this time it was churchpage7.html. And that page is *NOT* found by any search on my index, though I can follow links to it from other pages and browse it. I don't see how you can investigate this yet, but unless people put in reports like mine you will always be able to claim the "no-one else is having this problem". I will continue to look for a pattern which might give a clue. I'm inclined to think this is a platform-specific problem. Most of the trouble reports we've seen about IRIX systems are from users who can't even get htdig compiled, let alone running, so I don't think the package has had a thorough workout under IRIX. Which compier did you use to build it? -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 That is a possibilty worth pursuing. I use the standard MIPSpro compiler. The script I use (thanks to my former collegeaue James Hammick) to setup the Makefile is: #!/bin/sh CFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CFLAGS CPPFLAGS="-woff all -O2 -mips4 -n32 -DHAVE_ALLOCA_H" ; export CPPFLAGS LDFLAGS="-mips4 -L/usr/lib32 -rpath /opt/local/htdig-3.1.5/lib"; export LDFLAGS ./configure --prefix=/opt/local/htdig-3.1.5 \ --with-cgi-bin-dir=/opt/local/htdig-3.1.5/cgi-bin \ --with-image-dir=/opt/local/htdig-3.1.5/graphics \ --with-search-dir=/opt/local/htdig-3.1.5/htdocs/sample A lot of that is site-specific, and the "-rpath directory" option is only needed because the compression library is not in a standard place on the machine on which htdig is run. The "-woff all" option suppresses most warning messages. I will remove it, recompile htdig and send the result directly to Gilles, it might contain a clue. -- David Adams [EMAIL PROTECTED] Computing Services Southampton University To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] Htmerge: Deleted, invalid
Sorry for the length of this! According to David Adams: Why does htmerge 3.1.5 flag some pages, which look OK to me, as "Deleted, invalid" and not index them? This is happening not just with .html pages but also .doc and .pdf files. It happens with a simple merge following a run of htdig -i -a and also when two htdig runs are merged using the htdig -m option. htmerge does this when the remove_bad_urls attribute is true, and the page in question is not found (404 error), the server name no longer exists, the server is down, or in the case of an update dig, the page has been updated, superceding the old document database record for it. In the latter case, htdig creates a new record for the updated document, with a new DocID, so the old one is discarded. As this only happens in update digs, it wouldn't be the case during an htdig -i, so I'd look at the other possibilities. In any case, run both htdig and htmerge with at least two verbose options, and cross-reference the DocID of the "Deleted, invalid" messages to other messages with the same ID, to get a clearer picture of what's happening. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 I've run htdig -vv followed by htmerge -vvv and I still cannot see any reason why htmerge decides, apparently arbitrarily, that a page is invalid. None of the reasons given above seem to fit. I'll take a single example: http://www.tregalic.co.uk/sacred-heart/, is one of many in the limit_urls_to directive. Htdig finds http://www.tregalic.co.uk/sacred-heart/ and then http://www.tregalic.co.uk/sacred-heart/churchpage1.html http://www.tregalic.co.uk/sacred-heart/churchpage2.html ... http://www.tregalic.co.uk/sacred-heart/churchpage7.html amongst others. Grepping for "churchpage" in the htmerge log I find: htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage1.html htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage2.html htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage3.html htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage4.html htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage5.html htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage6.html htmerge: Merged URL: http://www.tregalic.co.uk/sacred-heart/churchpage7.html 1897/http://www.tregalic.co.uk/sacred-heart/churchpage1.html 1898/http://www.tregalic.co.uk/sacred-heart/churchpage2.html 1899/http://www.tregalic.co.uk/sacred-heart/churchpage3.html Deleted, invalid: 1900/http://www.tregalic.co.uk/sacred-heart/churchpage4.html Deleted, invalid: 1901/http://www.tregalic.co.uk/sacred-heart/churchpage5.html 1902/http://www.tregalic.co.uk/sacred-heart/churchpage6.html 1903/http://www.tregalic.co.uk/sacred-heart/churchpage7.html So I try an experiment: I reduce limit_urls_to include only the starting URL and http://www.tregalic.co.uk/sacred-heart/ and run htdig htmerge. Then htmerge reports: htmerge: Total word count: 3806 0/http://www.soton.ac.uk/services/local/alpha.html 1/http://www.tregalic.co.uk/sacred-heart/ 9/http://www.tregalic.co.uk/sacred-heart/baptism.html 2/http://www.tregalic.co.uk/sacred-heart/churchpage1.html 3/http://www.tregalic.co.uk/sacred-heart/churchpage2.html 4/http://www.tregalic.co.uk/sacred-heart/churchpage3.html 5/http://www.tregalic.co.uk/sacred-heart/churchpage4.html 6/http://www.tregalic.co.uk/sacred-heart/churchpage5.html 7/http://www.tregalic.co.uk/sacred-heart/churchpage6.html 8/http://www.tregalic.co.uk/sacred-heart/churchpage7.html htmerge: 10 12/http://www.tregalic.co.uk/sacred-heart/information.html 11/http://www.tregalic.co.uk/sacred-heart/links.html 10/http://www.tregalic.co.uk/sacred-heart/newsletter.html I do not accept that pages 4 5 just happened to unavailable on the first occasion and available on the second. Nor can I see any differences in the htdig logs for these pages. The same sizes are reported in both cases. I think there is a bug in htmerge 3.1.5 which causes it to declare some pages as "invalid" in some cases. -- David J Adams [EMAIL PROTECTED] Computing Services University of Southampton To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
Re: [htdig] Htmerge: Deleted, invalid
According to [EMAIL PROTECTED]: I've run htdig -vv followed by htmerge -vvv and I still cannot see any reason why htmerge decides, apparently arbitrarily, that a page is invalid. None of the reasons given above seem to fit. If you run htdig -vv, without a -i option, and you have an existing database, then htdig will run an update dig, not an initial dig, so it's possible that it will reindex churchpage4.html and churchpage5.html, but not the others. Are you certain that these two pages don't appear elsewhere in the htdig or htmerge logs, or for that matter that you're starting out without an existing database? So I try an experiment: I reduce limit_urls_to include only the starting URL and http://www.tregalic.co.uk/sacred-heart/ and run htdig htmerge. ... I do not accept that pages 4 5 just happened to unavailable on the first occasion and available on the second. Nor can I see any differences in the htdig logs for these pages. The same sizes are reported in both cases. If there's an existing database in the first case, but not the second, that may be the cause of the discrepancy. To be certain, use the -i option to htdig in all test cases, and let us know if it still finds these two pages as "invalid". I think there is a bug in htmerge 3.1.5 which causes it to declare some pages as "invalid" in some cases. That may be, but I want to be sure we've ruled out every other possibility first. I've never seen a bug report like this, so it would be very unusual if it is indeed a bug showing up in your case, but not for other users. If you can find a consistent test case that fails on an initial dig, please provide details on your OS, version, config, etc. so that we can look into this further. -- Gilles R. Detillieux E-mail: [EMAIL PROTECTED] Spinal Cord Research Centre WWW:http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax:(204)789-3930 To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.
[htdig] Htmerge: Deleted, invalid
Why does htmerge 3.1.5 flag some pages, which look OK to me, as "Deleted, invalid" and not index them? This is happening not just with .html pages but also .doc and .pdf files. It happens with a simple merge following a run of htdig -i -a and also when two htdig runs are merged using the htdig -m option. David Adams [EMAIL PROTECTED] Computing Services Southampton University To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] You will receive a message to confirm this.