Dear Ladies and Gentlemen, We have a major problem using ht://Dig with sites mostly hosting MS Office documents, e.g. a fileserver or a document archive browsable through the web. We are using ht://Dig 3.1.6 SSL on Solaris 8 / 9 systems. I think you discovered the following as well: - Office document types & versions are changing rapidly. - Most OpenSource native binary converters will or are not be continued in development. This leads to the problem, that those OpenSource converters crash with a segfault or cause such a significant high load on the servers as the subprocesses don't return due to some non-parsable documents. This causes the indexing process to hang and stop. Unfortunately, this will take ht://Dig out of work, if these document types can't be converted to html and thus be indexed. I can't make continuous test runs and constantly extend the exclusion list as this would mean also cutting off ht://Dig. We used the following converters: - pdf: Xpdf Xpdf can open most PDF files, but not those from Acrobat 5 to 6. (PDF Format 1.5/1.6) - ppt: ppthtml This tool is lacking development since '98 and can only process 97/98 Powerpoint files. - doc: wvware Fortunately, this works quite well. - xls: xlhtml see ppt, but development is stalling since 04-13-02 They worked fine until Office 2000 came out. I know there is doc2html.pl (or however it is called), but you have to tell doc2html.pl which native converter to use and that points back to the beginning. :)) Personally, I am on the edge of giving up. What type of converters do you use? What experience did you folks on the list made? Can you tell me some URL's where to look for better converters? I don't have any problems to buy some <convert-my-world>-software, if it is cheaper than HTML-Transit. :) Or I turn using Lucene (http://www.jguru.com/faq/Lucene) instead. Yours sincerely, Martin Allert -- -------------------------------------------------------- arago AG, Institut fuer komplexes Datenmanagement Am Niddatal 3, 60488 Frankfurt/Main, [EMAIL PROTECTED] Tel. 069/405680, Fax 069/40568111, http://www.arago.de --------------------------------------------------------
pgpMVOZv0ASYe.pgp
Description: PGP signature