Martin,

I am a bit surprised by this, but perhaps I am just fortunate that the
latest formats havn't yet appeared on the web sites I index.

PDF
The latest version of xpdf is 3.00 and that claims to cope with PDF 1.5.  It
certainly seems to work with those I have encountered.  Do you get an error
message with PDF 1.6 files or does it fail silently?

PPT
pptHtml seems to extract some text from the PowerPoint files I find on
searching.  However, I discovered that it does not extract text beyond the
first embedded image.  It also frequently produces the error message:

    travel: cole: No such file or directory

Does anyone know what that means?

DOC
I believe that the format of Word documents has not changed since Word97, so
most of the available converters should be fine.  Using doc2html would
enable you to cope with .doc files which are in fact plain text or RTF
files, or even WordPerfect documents.

Excel
I don't search these; I decided that a full text index of spreadsheets was a
step too far.  I believe there are Perl modules for reading spreadsheets
which may be suitable.  There is also at least one free utility which
converts Excel files to .csv files which is bundled with the catdoc utility.

Flash files
I try to extract links from Shockwave Flash files.  The utility swfdump (aka
swfparse) only handles Flash files version 1 to 5.  For version 6 (Flash MX)
files I use a Java application called JGenerator. (So to parse a Flash MX
file htdig calls a Perl script (doc2html.pl) which calls a second Perl
script which calls a Java application which reads the file and send O/P back
up the line to htdig !)

David Adams
Corporate Information Services
Information Systems Services
University of Southampton

----- Original Message ----- 
From: "Martin Allert" <[EMAIL PROTECTED]>
To: <[EMAIL PROTECTED]>
Sent: Tuesday, August 31, 2004 7:34 AM
Subject: [htdig] office document converters


   Dear Ladies and Gentlemen,


We have a major problem using ht://Dig with sites mostly hosting
MS Office documents, e.g. a fileserver or a document archive browsable
through the web. We are using ht://Dig 3.1.6 SSL on Solaris 8 / 9 systems.

I think you discovered the following as well:

- Office document types & versions are changing rapidly.
- Most OpenSource native binary converters will or are not be continued
  in development.

This leads to the problem, that those OpenSource converters crash with a
segfault or cause such a significant high load on the servers as the
subprocesses don't return due to some non-parsable documents. This causes
the indexing process to hang and stop.

Unfortunately, this will take ht://Dig out of work, if these document types
can't be converted to html and thus be indexed. I can't make continuous
test runs and constantly extend the exclusion list as this would mean also
cutting off ht://Dig.

We used the following converters:

- pdf: Xpdf
  Xpdf can open most PDF files, but not those from Acrobat 5 to 6.
  (PDF Format 1.5/1.6)

- ppt: ppthtml
  This tool is lacking development since '98 and can only process
  97/98 Powerpoint files.

- doc: wvware
  Fortunately, this works quite well.

- xls: xlhtml
  see ppt, but development is stalling since 04-13-02

They worked fine until Office 2000 came out.

I know there is doc2html.pl (or however it is called), but you have to
tell doc2html.pl which native converter to use and that points back to the
beginning. :)) Personally, I am on the edge of giving up.

What type of converters do you use? What experience did you folks on the
list made? Can you tell me some URL's where to look for better converters?
I don't have any problems to buy some <convert-my-world>-software, if it
is cheaper than HTML-Transit. :)

Or I turn using Lucene (http://www.jguru.com/faq/Lucene) instead.

Yours sincerely,

Martin Allert

-- 

--------------------------------------------------------
 arago AG, Institut fuer komplexes Datenmanagement
 Am Niddatal 3, 60488 Frankfurt/Main, [EMAIL PROTECTED]
 Tel. 069/405680, Fax 069/40568111, http://www.arago.de
--------------------------------------------------------



-------------------------------------------------------
This SF.Net email is sponsored by BEA Weblogic Workshop
FREE Java Enterprise J2EE developer tools!
Get your free copy of BEA WebLogic Workshop 8.1 today.
http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click
_______________________________________________
ht://Dig general mailing list: <[EMAIL PROTECTED]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to