According to Tim Cleary: > Gilles: > > I had a question about your perl script, conv_doc_PL that I found on htdig. > > I hope that it is all right to send you an email directly, my apologies if > not.
It's always better to post questions like this to htdig-general. As you've no doubt found out, there are many more users than just me who can provide answers. > Basically, I have a set of PDF, Excel/XLS, and Powerpoint/PPT files that I > want to be able to index. > > I have installed (OS X) 3 converters (PDFtoHTML, XLHTML, and PPTHTML) in > usr/local/bin. I tried just having these pass files directly from htdig but > there are periodic errors with individual files that stop the whole indexing > process. > > I don't know anything about perl, but here was my na�ve plan: > > Take your perl script and replace out my 3 converters in the top part of > your script (e.g,. Instead of word perfect filter, put in Excel) > > Use the logic you have which sends it to a different converter by guessing > that the first 8 bytes of PDF will contain "PDF" (as you have in your > script), XLS for Excel, PPT for Powerpoint. No, it's not that simple, I'm afraid. There's no unified standard for identification strings in various file formats, so you can't just patch in guesses like that and expect them to work. MS Office in particular doesn't make it easy to distinguish between different file types by means of an identification string -- they all seem to use the same sequence of bytes -- so you need to also resort to using the Content-Type argument (which the server returns usually based on the file name's extension). > For each of these, paste in the command line equivalents in where you create > the convertercommand to be passed. For example, with xlhtml, you have to > pass -stdout for it to spit out the conversion vs. the default of just > creating a file. > > Does this seem like a reasonable approach? > > Thanks for any suggestions you might have. I don't think conv_doc.pl is the best starting point for your project. I'd recommend either doc2html.pl (see http://www.htdig.org/FAQ.html#q4.8), or the scripts that Martin Allert just sent in earlier today. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/ Dept. Physiology, U. of Manitoba Winnipeg, MB R3E 3J7 (Canada) ------------------------------------------------------- SF.Net is sponsored by: Speed Start Your Linux Apps Now. Build and deploy apps & Web services for Linux with a free DVD software kit from IBM. Click Now! http://ads.osdn.com/?ad_id56&alloc_id438&op=click _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general

