Dear Ladies and Gentlemen,


We have a major problem using ht://Dig with sites mostly hosting 
MS Office documents, e.g. a fileserver or a document archive browsable 
through the web. We are using ht://Dig 3.1.6 SSL on Solaris 8 / 9 systems.

I think you discovered the following as well:

- Office document types & versions are changing rapidly.
- Most OpenSource native binary converters will or are not be continued 
  in development.

This leads to the problem, that those OpenSource converters crash with a
segfault or cause such a significant high load on the servers as the 
subprocesses don't return due to some non-parsable documents. This causes 
the indexing process to hang and stop. 

Unfortunately, this will take ht://Dig out of work, if these document types 
can't be converted to html and thus be indexed. I can't make continuous 
test runs and constantly extend the exclusion list as this would mean also 
cutting off ht://Dig.

We used the following converters:

- pdf:  Xpdf
  Xpdf can open most PDF files, but not those from Acrobat 5 to 6.
  (PDF Format 1.5/1.6)

- ppt:  ppthtml 
  This tool is lacking development since '98 and can only process
  97/98 Powerpoint files.

- doc:  wvware
  Fortunately, this works quite well.

- xls:  xlhtml
  see ppt, but development is stalling since 04-13-02

They worked fine until Office 2000 came out.

I know there is doc2html.pl (or however it is called), but you have to
tell doc2html.pl which native converter to use and that points back to the
beginning. :)) Personally, I am on the edge of giving up.

What type of converters do you use? What experience did you folks on the
list made? Can you tell me some URL's where to look for better converters?
I don't have any problems to buy some <convert-my-world>-software, if it
is cheaper than HTML-Transit. :)

Or I turn using Lucene (http://www.jguru.com/faq/Lucene) instead.

Yours sincerely,

Martin Allert

-- 

--------------------------------------------------------
 arago AG, Institut fuer komplexes Datenmanagement
 Am Niddatal 3, 60488 Frankfurt/Main, [EMAIL PROTECTED]
 Tel. 069/405680, Fax 069/40568111, http://www.arago.de
--------------------------------------------------------

Attachment: pgpMVOZv0ASYe.pgp
Description: PGP signature

Reply via email to