Howdy:

I now have a simple custom search form defined, with one selection box for topic area 
(using the exclude parameter to htsearch), and one for document (report) type using 
the restrict parameter.  It seems to work the way I want it to now, except for a 
couple issues.  I'm using this custom search on a tree of M$ Word documents.  I also 
use HtDig on a large html tree, but I don't have any of the following problems with 
that one.

1) I often get one or more directory indices (i.e., a URL that points to a directory) 
returned in the search results.  Sometimes they come out with a high score (at the 
top) and sometimes a low score (at the bottom).  I believe this is a result of using 
"doc" as a hidden search term, because the directory index is seen by htsearch as just 
another document (with a list *.doc files).  This was my boss's idea so he could get 
search results without entering any search terms (just select from the select boxes on 
the form).  I said this was probably not a good idea...  Any work-around tips or other 
suggestions?  I want Apache to index directories (in general) but I don't know of a 
way to turn it off in a given set of sub-directories.  Can anything in HtDig help me?

2) Using catdoc to convert the doc files to text, I sometimes get binary garbage in 
the long-form results.  Sometimes it's just a few characters, sometimes it's a *very* 
long string of garbage.  Here is an example of the former:

Word Document AR502-05.DOC 
    �������� � PROJECT/TASK : TPS/502 REPORT NO. : AR502-05
    ^^^^^^^^^^

I'd *really* like to get rid of these annoying garbage characters; I'm about to try a 
newer version of wv (wordview, whatever) to see if it helps.  The funny thing is, it 
only happens on some word docs.  Most are converted fine (i.e., without the garbage).  
Anybody have any tips for this one?

Thanks in advance for any suggestions, Steve

****************************************************************
Stephen L. Arnold                        Senior Systems Engineer
VAFB IV&V Activity                email:  [EMAIL PROTECTED]
ENSCO Inc.                            www:  http://www.ensco.com
P.O. Box 5488                                voice: 805.606.8838
Vandenberg AFB, CA  93437                      fax: 805.734.4779
                         
with Std.Disclaimer;  use Std.Disclaimer;
****************************************************************
 


------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.

Reply via email to