Hi,
The HtDig indexing process seems to be ignore all of
PDF, Word, Excel documents if the page sources are SSL-enabled as https and been
handled by the external handler. The html pages seems to be indexed
correctly even if they are from https web server.
List below is part of
/etc/htdig.conf configuration file:
external_parsers:
application/rtf->text/html /usr/local/bin/doc2html.pl
\
text/rtf->text/html
/usr/local/bin/doc2html.pl \
application/pdf->text/html
/usr/local/bin/doc2html.pl \
application/postscript->text/html
/usr/local/bin/doc2html.pl \
application/msword->text/html
/usr/local/bin/doc2html.pl \
application/msexcel->text/html
/usr/local/bin/doc2html.pl \
application/vnd.ms-excel->text/html /usr/local/bin/doc2html.pl
\
application/vnd.ms-powerpoint->text/html /usr/local/bin/doc2html.pl
\
application/x-shockwave-flash->text/html /usr/local/bin/doc2html.pl
\
application/x-shockwave-flash2-preview->text/html
/usr/local/bin/doc2html.pl
external_protocols:
https /usr/local/bin/handler.pl
However, all of PDF, Word, Excel, and PPT documents are indexed
correctly if the page sources are via http, which is handled by the internal
HtDig indexing handler.
The log of all PDF, Word, Excel, etc documents is recorded in /var/log/doc2html.log, and it shows no evidence of any documents been indexed if they are from https://. In addition, htstat -u shows no PDF, Word documents been indexed from any https web server.
The HtDig search engine is version 3.2.0-1.b4.0.72 and is running on RedHat Linux v.7.2 with kernel 2.4.9-31.
I appreciate if anyone can share the idea and/or workaround for this https indexing issue. Thank you.

