Martin, You might ask whether it is really necessary to index spreadsheets. My argument for ignoring them is that it is better if the search engine finds the web page with the link to the spreadsheet rather than taking the user directly to the spreadsheet.
As regards .ppt files, I used to notice ppthtml running in very large processes when we ran htdig under Solaris. I can't recall ever having to intervene and kill them, but I did add the command limit vmemory 200m to the rundig script to limit the size of the process they could use. When we moved to RedHat Linux the problem went away. However, if somebody knows of a better converter for Powerpoint then I too would like to hear about it. David Adams Corporate Information Services Information Systems Services University of Southampton ----- Original Message ----- From: "Martin Allert" <[EMAIL PROTECTED]> To: "David Adams" <[EMAIL PROTECTED]> Cc: <[EMAIL PROTECTED]> Sent: Tuesday, August 31, 2004 2:10 PM Subject: Re: [htdig] office document converters Hi David, On Tue, Aug 31, 2004 at 02:02:35PM +0100, David Adams wrote: > Martin, > > This is a joke, yes? > > If not, please note that the Lucene FAQs make it clear that it is equally > dependant on external parsers. That's what a colleague recommended to me: to look in the lucene FAQ's whether there is any alternative to the already mentioned parsers.. Honestly, I didn't take a look first before writing my email. :( Fact is: indexing some webtree with the mentioned ppthtml, xlhtml or xpdf takes ten times longer with a load of 10 on a dualproc Sun V480 with 4G RAM. Indexing only .doc files and .html rundig completes in about 30mins. I discover hanging ppthtml and xlhtml processes, consuming nearly 95% CPU and consuming about 1GB RAM for each document. Of course, those processes don't come back and have to be killed... :( > We use wp2html to convert Word documents and it's fine,but we bought it only > because we needed to convert Wordperfect documents (not that we get many!) > > David Adams Yours, Martin -- -------------------------------------------------------- arago AG, Institut fuer komplexes Datenmanagement Am Niddatal 3, 60488 Frankfurt/Main, [EMAIL PROTECTED] Tel. 069/405680, Fax 069/40568111, http://www.arago.de -------------------------------------------------------- ------------------------------------------------------- This SF.Net email is sponsored by BEA Weblogic Workshop FREE Java Enterprise J2EE developer tools! Get your free copy of BEA WebLogic Workshop 8.1 today. http://ads.osdn.com/?ad_id=5047&alloc_id=10808&op=click _______________________________________________ ht://Dig general mailing list: <[EMAIL PROTECTED]> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general