According to Wendt, Trevor:
> The mime-type (Content-Type) thats being return for somefile.doc is
> "application/msword" - off of an IIS4 box. 
> 
> I have "application/msword" configured in my htdig .conf file under: 
>    external_parsers: \
>      application/pdf->text/html
> /export/home/htdig-3.1.6/scripts/doc2html/pdf2html.pl \
>      application/postscript->text/html
> /export/home/htdig-3.1.6/scripts/doc2html/doc2html.pl\ 
>      application/msword->text/html
> /export/home/htdig-3.1.6/scripts/doc2html/doc2html.pl
> 
> Please see attached doc2html.pl.txt for configuration settings there - all
> appear to be setup correctly to me. 
> 
> Still recieving the same error, on all ms word (95, 97&2000) documents:
>   5:5:1:http://www.domain.com/somefile.doc: !  UNABLE to convert
> 
> I did a little more testing and tried the same document but in different
> formats. One in Word95, one in Word 97 (which is the same as 2000), and
> Word97 RTF.  The Word 97 RTF worked great, but the other *.doc files all
> produced the same error (! UNABLE to convert). 

Well, all your settings in doc2html.pl appear correct to me too.  However,
if you're getting "UNABLE to convert", it's because it can't find a conversion
method that matches both the Content-Type and magic number of your document.
Obviously you have a few methods that define a Content-Type of application/msword, so 
the mismatch for somefile.doc must be the magic number.
Your Word (wp2html) method defines...

    $magic = '^\320\317\021\340';

What do you see when you do "od -b somefile.doc | head -1"?  If the first
4 bytes of the file aren't 320 317 021 340, then you'll need to edit the
$magic definition to allow one or more alternate magic numbers to match
those in your Word documents.  E.g.:

    $magic = '^\320\317\021\340|^\333\245-\000|^\3767\000#\000\000\000\000';

The reason for these magic numbers it so that doc2html.pl can do further
selections, when Content-Type alone isn't enough.  Most web servers will
tag any .doc file as application/msword, even though some .doc files
are in fact WP, RTF, or ASCII documents.  You may also need different
conversion filters for different variations or versions of a given file
type.  The magic number tests allow you to do this further weeding out,
but it can be a pain to set up correctly, as not all file formats have
consistent and predictable magic numbers or strings.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone?  Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to