According to Frank Richter: > > Could you set up a configuration file that digs only this document, e.g.: > > > > start_url: http://www.tu-chemnitz.de/wirtschaft/bwl2/download/portrait.doc > > > > and then run htdig with -vvvvvv, using this configuration, and your > > current parse_word_doc.pl script. I'd like more info about what's > > happening prior to the core dump. > > I did it, see attached file. You see many many binary data... > Of course a workaround is to change the external parser to avoid such > garbage, but htdig should be robust enough... The log file you sent me unfortunately didn't tell me much, but I did manage to reproduce the problem. I realised, when I saw how big the portrait.doc file was, that my htdig was truncating it. I increased max_doc_size to 2000000, and sure enough, htdig dumped core on your document. In looking at your stack backtrace previously, I was so focused on the garbage words that got_word was getting, that I failed to realise the problem was the value for heading, which was way out of range, and was being used, unchecked, as an array subscript. The problem you reported seems to be different than the one Jesse had, which I still can't reproduce, but I hope that with this patch, and my earlier fixes to ExternalParser.cc, it'll solve that problem too! Here's the patch for your problem, Frank. Now, instead of getting a core dump, you'll get a whole bunch of External parser error messages. For the sake of defensive programming, Retriever::got_word() should probably still be fixed to check "heading" before using it as a subscript, but I decided to put a check in ExternalParser.cc so the error can be reported there. --- ./htdig/ExternalParser.cc.wordbug Tue Feb 9 18:26:08 1999 +++ ./htdig/ExternalParser.cc Fri Feb 12 12:22:52 1999 @@ -148,6 +148,7 @@ String line; char *token1, *token2, *token3; + int loc, hd; URL url; while (readLine(input, line)) { @@ -164,8 +165,10 @@ token2 = strtok(0, "\t"); if (token2 != NULL) token3 = strtok(0, "\t"); - if (token1 != NULL && token2 != NULL && token3 != NULL) - retriever.got_word(token1, atoi(token2), atoi(token3)); + if (token1 != NULL && token2 != NULL && token3 != NULL && + (loc = atoi(token2)) >= 0 && loc <= 1000 && + (hd = atoi(token3)) >= 0 && hd < 12) + retriever.got_word(token1, loc, hd); else cerr<< "External parser error in line:"<<line<<"\n"; break; -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 ------------------------------------ To unsubscribe from the htdig mailing list, send a message to [EMAIL PROTECTED] containing the single word "unsubscribe" in the SUBJECT of the message.
