Another option that I've used is to convert Word documents to xml and then parse the xml. The .doc to .xml conversion can be done in scripting using the Word.Application object and the xml parsing in any way you want to parse the xml. I had a lot of files that were very similar in format and found this process somewhat painful but it worked.
Bob -------------- Original message ---------------------- From: Steve Welborn <[EMAIL PROTECTED]> > You could try the Automation/Server idea, MS makes it > easy to use, but like > most here I've had nothing but nightmares with it. > Automation with Word is a > memory hog, majority of the time the instances still > remain in memory > despite whatever measure you take to close it and not > to mention the crash's > that have or could occur. > > But from what you described your use to be I would > probably go with a > Service as well. I would just be sure to double check > to get it out of > memory when done. > > > Good luck. > Steve > > -----Original Message----- > From: Discussion of advanced .NET topics. > [mailto:[EMAIL PROTECTED] On Behalf > Of Marc Brooks > Sent: Monday, December 11, 2006 12:22 AM > To: ADVANCED-DOTNET@DISCUSS.DEVELOP.COM > Subject: Re: [ADVANCED-DOTNET] Does anyone know how to > read a Word document > in .Net 2003? > > On 12/10/06, Jon Rothlander <[EMAIL PROTECTED]> > wrote: > > I think that is what I want to do. I just want > something that will > convert > > it to text. I was just thinking that if in a .Net > app you can easily open > > the Word doc and the save it back out as a Text > file... > > Having been there, done that, and regretted it, let me > share. I > worked on a project[1] that used to extract resumes in > Word/Word > Perfect/etc. documents via automation so we could pass > them through an > expert system to extract the information. The WinWord > process > constantly crashed and locked the service. > > Eventually, after trying several commercial conversion > tools > (including several supposed to be used in batch > conversion or > server-based setups), nothing was working. > > Then I hit on the radical idea that "if it's good > enough for > index-server[2], it's good enough for me" and used the > installed > IFilter drivers to suck out the text of any file we > had an IFilter > driver (and dude, are there tons of them available for > free). I wrote > a little COM component in C++ that simply defers to > the shell to load > the correct driver and then ignored all the > "formatting" information > and kept the text, which is returned as a BSTR. > Optionally, you can > ask it to "clean the text" to normalize the Unicode > encodings and > morphing digits-like characters to actual digits > > If you are interested, I can post the source for > this... it is still > in service to this day and it really works well. > > [1] http://www.sendouts.com > [2] > http://msdn.microsoft.com/library/default.asp?url=/library/en-us/indexsrv/hh > /indexsrv/ixufilt_94fm.asp > > IFilters: > http://www.adobe.com/support/downloads/8122.htm > http://www.corel.com/support/ftpsite/pub/wordperfect/wpwin/8/cwps8.htm# > http://www.adobe.com/support/downloads/8126.htm > http://www.cad-company.nl/ifilter/ > http://www.microsoft.com/sharepoint/techinfo/reskit/RTF_Filter.asp > http://www.microsoft.com/sharepoint/techinfo/reskit/XML_Filter.asp > http://www.naa.gov.au/Search/srchadm/help/default.htm#Top > http://www.mp3machine.com/software/MP3_Ifilter/= > > -- > "I am Dyslexic of Borg. Resistors are fertile. Prepare > to have your > ass laminated." -- Dan Nitschke > > Marc C. Brooks > http://musingmarc.blogspot.com > > =================================== > This list is hosted by DevelopMentorR > http://www.develop.com > > View archives and manage your subscription(s) at > http://discuss.develop.com > > > > > ________________________________________________________________________________ > ____ > Do you Yahoo!? > Everyone is raving about the all-new Yahoo! Mail beta. > http://new.mail.yahoo.com > > =================================== > This list is hosted by DevelopMentor® http://www.develop.com > > View archives and manage your subscription(s) at http://discuss.develop.com =================================== This list is hosted by DevelopMentor® http://www.develop.com View archives and manage your subscription(s) at http://discuss.develop.com