I think there are two separate problems here.

1) For powerpoint documents your server is using the the MIME-type "application/powerpoint", whereas the more usual is "application/vnd.ms-powerpoint".

This is easily fixed. First, make sure that your external_parsers: statement includes "application/powerpoint". Secondly, modify doc2html.pl: change

   $mime_type = "application/vnd.ms-powerpoint";
to
   $mime_type = "application/vnd.ms-powerpoint|application/powerpoint";

2) "HTTP/1.1 401 Authorization Required" may be the result of a .htaccess file in one or more directories. Failing that, look closely at the web server's configuration files.

If "Authorization Required" is actually wanted then it is possible to supply htdig with a username and password, see the online manuals.

If there is a genuine security requirement on these files, then you have to decide whether they should be in a publicly accessible search index which contains excerpts from them.

David Adams
Corporate Information Services
Information Systems Services
University of Southampton

----- Original Message ----- From: "Jay Moore" <[EMAIL PROTECTED]>
To: <htdig-general@lists.sourceforge.net>
Sent: Tuesday, January 04, 2005 8:06 PM
Subject: [htdig] indexing problems for .ppt, .xls, .doc



I am using htdig-3.2.0-2.011302 (which I believe is htdig-3.2.0b6 ?) and am having trouble indexing Microsoft documents (.ppt, .doc, .xls). I think I have the htdig.conf set up correctly and am attempting to use doc2html.pl to parse documents. The ps and pdf's are accessed locally and indexed with no problems. The Microsoft-type documents cannot be locally accessed (which is expected, I have read), but problems arise when accessing them via http. The ppt documents seemed to be accessed in that the file size is determined and the mime type disclosed, but they are not indexed. As for the .doc and .xls documents, the mime type cannot be deciphered and there is an authorization problem even though the authorization is correctly specified in the config file (it works for the other document types). I have defined the external parser as doc2html.pl in the config file for all these mime types and have specified within doc2html.pl the paths to the different conversion routines (pstotext, pdf2html.pl, ppthtml, xlhtml, catdoc.) Examples of the rundig -vvv output for each document type are below. Thanks to anyone for any ideas about what's wrong! --Jay

1) For .pdf's :
pick: www-btev.fnal.gov, # servers = 1
> www-btev.fnal.gov supports HTTP persistent connections (infinite)
0:2:0:http://www-btev.fnal.gov/DocDB/0025/002515/001/UnivCyber.pdf: Trying local files
found existing file /www/BTEV/html/DocDB/0025/002515/001/UnivCyber.pdf
Read 8192 from document
Read 8192 from document
...


2) For .ps's :
pick: www-btev.fnal.gov, # servers = 1
> www-btev.fnal.gov supports HTTP persistent connections (infinite)
0:2:0:http://www-btev.fnal.gov/DocDB/0017/001714/002/ewv_proceedings.ps: Trying local files
found existing file /www/BTEV/html/DocDB/0017/001714/002/ewv_proceedings.ps
Read 8192 from document
Read 8192 from document
...


3) For .ppt's :
pick: www-btev.fnal.gov, # servers = 1
> www-btev.fnal.gov supports HTTP persistent connections (infinite)
0:2:0:http://www-btev.fnal.gov/DocDB/0025/002515/001/UnivCyber.ppt: Trying local files
found existing file /www/BTEV/html/DocDB/0025/002515/001/UnivCyber.ppt
Local retrieval failed, trying HTTP
Making HTTP request on http://www-btev.fnal.gov/DocDB/0025/002515/001/UnivCyber.ppt
Header line: HTTP/1.1 200 OK
Header line: Date: Wed, 17 Nov 2004 15:45:50 GMT
Header line: Server: Apache/1.3.31 (Unix) mod_jk/1.2.6 PHP/4.1.2 mod_fastcgi/2.4.2 mod_ssl/2.8.19 OpenSSL/0.9.6e
Header line: Last-Modified: Wed, 11 Feb 2004 17:11:08 GMT
Header line: ETag: "93803a-6b9c00-402a622c"
Header line: Accept-Ranges: bytes
Header line: Content-Length: 7052288
Header line: Content-Type: application/powerpoint
Request time: 0 secs
size = 7052288
pick: www-btev.fnal.gov, # servers = 1
> www-btev.fnal.gov supports HTTP persistent connections (infinite)
ht://dig End Time: Wed Nov 17 09:45:51 2004
ID: 2 URL: http://www-btev.fnal.gov/DocDB/0025/002515/001/UnivCyber.ppt


4) For .xls's and .doc's :
> www-btev.fnal.gov supports HTTP persistent connections (infinite)
0:2:0:http://www-btev.fnal.gov/DocDB/0028/002818/001/Summary_04.xls: Trying local files
found existing file /www/BTEV/html/DocDB/0028/002818/001/Summary_04.xls
Local retrieval failed, trying HTTP
Making HTTP request on http://www-btev.fnal.gov/DocDB/0028/002818/001/Summary_04.xls
Header line: HTTP/1.1 401 Authorization Required
Header line: Date: Wed, 17 Nov 2004 15:40:07 GMT
Header line: Server: Apache/1.3.31 (Unix) mod_jk/1.2.6 PHP/4.1.2 mod_fastcgi/2.4.2 mod_ssl/2.8.19 OpenSSL/0.9.6e
Header line: WWW-Authenticate: Basic realm="BTeV"
Header line: Transfer-Encoding: chunked
Header line: Content-Type: text/html; charset=iso-8859-1
Request time: 0 secs
not authorized
pick: www-btev.fnal.gov, # servers = 1
> www-btev.fnal.gov supports HTTP persistent connections (infinite)
ht://dig End Time: Wed Nov 17 09:40:07 2004
Deleted, not found: ID: 2 URL: http://www-btev.fnal.gov/DocDB/0028/002818/001/Summary_04.xls





------------------------------------------------------- The SF.Net email is sponsored by: Beat the post-holiday blues Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek. It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt _______________________________________________ ht://Dig general mailing list: <htdig-general@lists.sourceforge.net> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general




------------------------------------------------------- The SF.Net email is sponsored by: Beat the post-holiday blues Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek. It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt _______________________________________________ ht://Dig general mailing list: <htdig-general@lists.sourceforge.net> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to