Since no one asked for it explicitly (at least on the mailing list), I for one would be interested in your code Marc if you still want to share it.
Thanks, Sébastien www.sebastienlorion.com On 12/11/06, Marc Brooks <[EMAIL PROTECTED]> wrote:
On 12/10/06, Jon Rothlander <[EMAIL PROTECTED]> wrote: > I think that is what I want to do. I just want something that will convert > it to text. I was just thinking that if in a .Net app you can easily open > the Word doc and the save it back out as a Text file... Having been there, done that, and regretted it, let me share. I worked on a project[1] that used to extract resumes in Word/Word Perfect/etc. documents via automation so we could pass them through an expert system to extract the information. The WinWord process constantly crashed and locked the service. Eventually, after trying several commercial conversion tools (including several supposed to be used in batch conversion or server-based setups), nothing was working. Then I hit on the radical idea that "if it's good enough for index-server[2], it's good enough for me" and used the installed IFilter drivers to suck out the text of any file we had an IFilter driver (and dude, are there tons of them available for free). I wrote a little COM component in C++ that simply defers to the shell to load the correct driver and then ignored all the "formatting" information and kept the text, which is returned as a BSTR. Optionally, you can ask it to "clean the text" to normalize the Unicode encodings and morphing digits-like characters to actual digits If you are interested, I can post the source for this... it is still in service to this day and it really works well. [1] http://www.sendouts.com [2] http://msdn.microsoft.com/library/default.asp?url=/library/en-us/indexsrv/hh/indexsrv/ixufilt_94fm.asp IFilters: http://www.adobe.com/support/downloads/8122.htm http://www.corel.com/support/ftpsite/pub/wordperfect/wpwin/8/cwps8.htm# http://www.adobe.com/support/downloads/8126.htm http://www.cad-company.nl/ifilter/ http://www.microsoft.com/sharepoint/techinfo/reskit/RTF_Filter.asp http://www.microsoft.com/sharepoint/techinfo/reskit/XML_Filter.asp http://www.naa.gov.au/Search/srchadm/help/default.htm#Top http://www.mp3machine.com/software/MP3_Ifilter/= -- "I am Dyslexic of Borg. Resistors are fertile. Prepare to have your ass laminated." -- Dan Nitschke Marc C. Brooks http://musingmarc.blogspot.com =================================== This list is hosted by DevelopMentor(r) http://www.develop.com View archives and manage your subscription(s) at http://discuss.develop.com
-- Sébastien www.sebastienlorion.com