According to Tsai, Jin: > The HtDig indexing process seems to be ignore all of PDF, Word, Excel > documents if the page sources are SSL-enabled as https and been handled by > the external handler. The html pages seems to be indexed correctly even if > they are from https web server. ... > external_protocols: https /usr/local/bin/handler.pl > > However, all of PDF, Word, Excel, and PPT documents are indexed correctly if > the page sources are via http, which is handled by the internal HtDig > indexing handler.
I'd be interested in knowing how your handler.pl script puts out the "t" records for Content-Type, which is critical for htdig properly identifying what's what as far as file types. E.g., for a PDF, it should emit a header record like this: t: application/pdf with one tab and no spaces. If it's not doing that, that may be the cause of your problem. > The log of all PDF, Word, Excel, etc documents is recorded in > /var/log/doc2html.log, and it shows no evidence of any documents been > indexed if they are from https://. In addition, htstat -u shows no PDF, > Word documents been indexed from any https web server. > > The HtDig search engine is version 3.2.0-1.b4.0.72 and is running on RedHat > Linux v.7.2 with kernel 2.4.9-31. The htdig-3.2.0-1.b4.0.72 RPM was built with a late October 2001 snapshot of 3.2.0b4, which had a problem with how the external transport handler managed the access time object. I don't know if that could lead to the problem you report, but if the handler.pl is functioning correctly, it may be worth a try rebuilding with a more recent snapshot. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

