Martin, I am a bit surprised by this, but perhaps I am just fortunate that the latest formats havn't yet appeared on the web sites I index.
PDF The latest version of xpdf is 3.00 and that claims to cope with PDF 1.5. It certainly seems to work with those I have encountered. Do you get an error message with PDF 1.6 files or does it fail silently? PPT pptHtml seems to extract some text from the PowerPoint files I find on searching. However, I discovered that it does not extract text beyond the first embedded image. It also frequently produces the error message: travel: cole: No such file or directory Does anyone know what that means? DOC I believe that the format of Word documents has not changed since Word97, so most of the available converters should be fine. Using doc2html would enable you to cope with .doc files which are in fact plain text or RTF files, or even WordPerfect documents. Excel I don't search these; I decided that a full text index of spreadsheets was a step too far. I believe there are Perl modules for reading spreadsheets which may be suitable. There is also at least one free utility which converts Excel files to .csv files which is bundled with the catdoc utility. Flash files I try to extract links from Shockwave Flash files. The utility swfdump (aka swfparse) only handles Flash files version 1 to 5. For version 6 (Flash MX) files I use a Java application called JGenerator. (So to parse a Flash MX file htdig calls a Perl script (doc2html.pl) which calls a second Perl script which calls a Java application which reads the file and send O/P back up the line to htdig !) David Adams Corporate Information Services Information Systems Services University of Southampton ----- Original Message ----- From: "Martin Allert" <[EMAIL PROTECTED]> To: <[EMAIL PROTECTED]> Sent: Tuesday, August 31, 2004 7:34 AM Subject: [htdig] office document converters Dear Ladies and Gentlemen, We have a major problem using ht://Dig with sites mostly hosting MS Office documents, e.g. a fileserver or a document archive browsable through the web. We are using ht://Dig 3.1.6 SSL on Solaris 8 / 9 systems. I think you discovered the following as well: - Office document types & versions are changing rapidly. - Most OpenSource native binary converters will or are not be continued in development. This leads to the problem, that those OpenSource converters crash with a segfault or cause such a significant high load on the servers as the subprocesses don't return due to some non-parsable documents. This causes the indexing process to hang and stop. Unfortunately, this will take ht://Dig out of work, if these document types can't be converted to html and thus be indexed. I can't make continuous test runs and constantly extend the exclusion list as this would mean also cutting off ht://Dig. We used the following converters: - pdf: Xpdf Xpdf can open most PDF files, but not those from Acrobat 5 to 6. (PDF Format 1.5/1.6) - ppt: ppthtml This tool is lacking development since '98 and can only process 97/98 Powerpoint files. - doc: wvware Fortunately, this works quite well. - xls: xlhtml see ppt, but development is stalling since 04-13-02 They worked fine until Office 2000 came out. I know there is doc2html.pl (or however it is called), but you have to tell doc2html.pl which native converter to use and that points back to the beginning. :)) Personally, I am on the edge of giving up. What type of converters do you use? What experience did you folks on the list made? Can you tell me some URL's where to look for better converters? I don't have any problems to buy some <convert-my-world>-software, if it is cheaper than HTML-Transit. :) Or I turn using Lucene (http://www.jguru.com/faq/Lucene) instead. Yours sincerely, Martin Allert -- -------------------------------------------------------- arago AG, Institut fuer komplexes Datenmanagement Am Niddatal 3, 60488 Frankfurt/Main, [EMAIL PROTECTED] Tel. 069/405680, Fax 069/40568111, http://www.arago.de -------------------------------------------------------- ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general