----- Original Message ----- From: "F. Spitzer, GEOSYSTEMS" <[EMAIL PROTECTED]>
To: "David Adams" <[EMAIL PROTECTED]>
Sent: Wednesday, January 19, 2005 11:08 AM
Subject: Re: [htdig] Indexing large Powerpoints
Hi David,
thanks for your answer. You were right. Changing the value in doc2html.pl did solve the problem! Great.
Buy the way: I am working on Suse 9.2 and I was able to index ppts with over 90 MB.
Thanks a lot!!
Cheers Fritz
Mit freundlichen GrÃÃen
Fritz Spitzer Schulungsleitung und Systemintegration
-------------------------------------------------------------------- GEOSYSTEMS GmbH RiesstraÃe 10, D-82110 Germering, GERMANY www.geosystems.de
E: [EMAIL PROTECTED] T: +49-(0)89-89 43 43 -0 (Ext. -20) F: +49-(0)89-89 43 43 99
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Abonnieren Sie unseren Newsletter, um immer auf dem Laufenden zu sein: www.geosystems.de/newsletter
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
David Adams schrieb:How are you using htdig to index .ppt files? Recent versions of doc2html.pl have a default input limit of 20Mbytes and will not try to convert files any larger. Just increase the limit in the doc2html.pl script.
I have found that ppthtml 0.4 from www.xlhtml.org (now relocated to http://chicago.sourceforge.net/xlhtml), which is what I use, does not always succeed in extracting text after the first embedded image.
I have not found problems with ppthtml on RedHat Linux, but on Solaris the process size could be very large. With >20Mbytes .ppt files I doubt if it would run.
David Adams Corporate Information Services Information Systems Services University of Southampton
----- Original Message ----- From: "F. Spitzer, GEOSYSTEMS" <[EMAIL PROTECTED]>
To: <htdig-general@lists.sourceforge.net>
Sent: Wednesday, January 19, 2005 6:48 AM
Subject: [htdig] Indexing large Powerpoints
Good morning List!
I have one problem to solve. Maybe you can help me?
We have a huge (more than 250) Powerpoint collection. So I want htdig to build up an index, allowing the users to search for keywords.
Things are working so far. Htdig does itâs job quite well. The only problem that I still have consists with ppt-files larger than 20 MB. Unfortunately nearly 50% of the files are larger than 20 MB.
I set max_doc_size to 80000000 (80MB, this is the largest ppt). But running htdig will produce the following output: Input file size of 45956608 at or above 20000000 limit.
For me it seems, that there is an other limitation of htdig, that ignores the value set by max_doc_size.
How can I overcome this limitation?
I though about writing a shell script that does the conversion of ppt to html before running htdig. Htdig will than use the html files for building up the index. Using url_part_aliases during db creation and during the search will replace the html-doc location to the original ppt location.
Has anybody did this before? Ore even better is there an other solution for my problem.
Thanks a lot for you help. Any hints are welcome.
Cheers Fritz
Fritz Spitzer Schulungsleitung und Systemintegration
-------------------------------------------------------------------- GEOSYSTEMS GmbH RiesstraÃe 10, D-82110 Germering, GERMANY www.geosystems.de
E: [EMAIL PROTECTED] T: +49-(0)89-89 43 43 -0 (Ext. -20) F: +49-(0)89-89 43 43 99
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Abonnieren Sie unseren Newsletter, um immer auf dem Laufenden zu sein: www.geosystems.de/newsletter
+ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
------------------------------------------------------- The SF.Net email is sponsored by: Beat the post-holiday blues Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek. It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt _______________________________________________ ht://Dig general mailing list: <htdig-general@lists.sourceforge.net> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general
------------------------------------------------------- The SF.Net email is sponsored by: Beat the post-holiday blues Get a FREE limited edition SourceForge.net t-shirt from ThinkGeek. It's fun and FREE -- well, almost....http://www.thinkgeek.com/sfshirt _______________________________________________ ht://Dig general mailing list: <htdig-general@lists.sourceforge.net> ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html List information (subscribe/unsubscribe, etc.) https://lists.sourceforge.net/lists/listinfo/htdig-general