Hi,

The HtDig indexing process seems to be ignore all of PDF, Word, Excel documents if the page sources are SSL-enabled as https and been handled by the external handler.  The html pages seems to be indexed correctly even if they are from https web server.

List below is part of /etc/htdig.conf configuration file:


external_parsers:       application/rtf->text/html /usr/local/bin/doc2html.pl \
                        text/rtf->text/html /usr/local/bin/doc2html.pl \
                        application/pdf->text/html /usr/local/bin/doc2html.pl \
                        application/postscript->text/html /usr/local/bin/doc2html.pl \
                        application/msword->text/html /usr/local/bin/doc2html.pl \
                        application/msexcel->text/html /usr/local/bin/doc2html.pl \
                        application/vnd.ms-excel->text/html /usr/local/bin/doc2html.pl \
                        application/vnd.ms-powerpoint->text/html /usr/local/bin/doc2html.pl \
                        application/x-shockwave-flash->text/html /usr/local/bin/doc2html.pl \
                        application/x-shockwave-flash2-preview->text/html /usr/local/bin/doc2html.pl

external_protocols:     https /usr/local/bin/handler.pl

However, all of PDF, Word, Excel, and PPT documents are indexed correctly if the page sources are via http, which is handled by the internal HtDig indexing handler.

The log of all PDF, Word, Excel, etc documents is recorded in /var/log/doc2html.log, and it shows no evidence of any documents been indexed if they are from https://.  In addition, htstat -u shows no PDF, Word documents been indexed from any https web server.

The HtDig search engine is version 3.2.0-1.b4.0.72 and is running on RedHat Linux v.7.2 with kernel 2.4.9-31.

I appreciate if anyone can share the idea and/or workaround for this https indexing issue.  Thank you.

Best Regards,
 
Jin Tsai
Florida Hospital, MIS
Phone: (407) 303-9539

Reply via email to