Word Doc:
$od -b /export/home/htdig-3.1.6/scripts/doc2html/IntranetROI.doc | head -1
0000000 320 317 021 340 241 261 032 341 000 000 000 000 000 000 000 000

Looks like the magic numbers match when it's on the local box (which is
solaris) but the file itself is located on an NT/IIS 4.0 box. I didn't think
that would cause a problem but for kicks I downloaded hod, a nice little
octal dump program for windows, and the dump output matches on NT as well. 

Since the Word RTF is working, here's the od output from it. 
RTF Doc: 
$ od -b /export/home/htdig-3.1.6/scripts/doc2html/IntranetROI_wo*.doc | head
-1
0000000 173 134 162 164 146 061 134 141 156 163 151 134 141 156 163 151

As of now, I have not modified anything in my doc2html.pl file since my last
email. 

Any other ideas? I do appreciate all the help! 

-Trevor



-----Original Message-----
From: Gilles Detillieux [mailto:[EMAIL PROTECTED]]
Sent: Friday, September 06, 2002 11:26 AM
To: Wendt, Trevor
Cc: [EMAIL PROTECTED]
Subject: Re: [htdig] htdig & wp2html problems


According to Wendt, Trevor:
> The mime-type (Content-Type) thats being return for somefile.doc is
> "application/msword" - off of an IIS4 box. 
> 
> I have "application/msword" configured in my htdig .conf file under: 
>    external_parsers: \
>      application/pdf->text/html
> /export/home/htdig-3.1.6/scripts/doc2html/pdf2html.pl \
>      application/postscript->text/html
> /export/home/htdig-3.1.6/scripts/doc2html/doc2html.pl\ 
>      application/msword->text/html
> /export/home/htdig-3.1.6/scripts/doc2html/doc2html.pl
> 
> Please see attached doc2html.pl.txt for configuration settings there - all
> appear to be setup correctly to me. 
> 
> Still recieving the same error, on all ms word (95, 97&2000) documents:
>   5:5:1:http://www.domain.com/somefile.doc: !  UNABLE to convert
> 
> I did a little more testing and tried the same document but in different
> formats. One in Word95, one in Word 97 (which is the same as 2000), and
> Word97 RTF.  The Word 97 RTF worked great, but the other *.doc files all
> produced the same error (! UNABLE to convert). 

Well, all your settings in doc2html.pl appear correct to me too.  However,
if you're getting "UNABLE to convert", it's because it can't find a
conversion
method that matches both the Content-Type and magic number of your document.
Obviously you have a few methods that define a Content-Type of
application/msword, so the mismatch for somefile.doc must be the magic
number.
Your Word (wp2html) method defines...

    $magic = '^\320\317\021\340';

What do you see when you do "od -b somefile.doc | head -1"?  If the first
4 bytes of the file aren't 320 317 021 340, then you'll need to edit the
$magic definition to allow one or more alternate magic numbers to match
those in your Word documents.  E.g.:

    $magic = '^\320\317\021\340|^\333\245-\000|^\3767\000#\000\000\000\000';

The reason for these magic numbers it so that doc2html.pl can do further
selections, when Content-Type alone isn't enough.  Most web servers will
tag any .doc file as application/msword, even though some .doc files
are in fact WP, RTF, or ASCII documents.  You may also need different
conversion filters for different variations or versions of a given file
type.  The magic number tests allow you to do this further weeding out,
but it can be a pain to set up correctly, as not all file formats have
consistent and predictable magic numbers or strings.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/
Dept. Physiology, U. of Manitoba  Winnipeg, MB  R3E 3J7  (Canada)


-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone?  Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Reply via email to