Since no one asked for it explicitly (at least on the mailing list), I
for one would be interested in your code Marc if you still want to
share it.

Thanks,

Sébastien
www.sebastienlorion.com

On 12/11/06, Marc Brooks <[EMAIL PROTECTED]> wrote:
On 12/10/06, Jon Rothlander <[EMAIL PROTECTED]> wrote:
> I think that is what I want to do.  I just want something that will convert
> it to text.  I was just thinking that if in a .Net app you can easily open
> the Word doc and the save it back out as a Text file...

Having been there, done that, and regretted it, let me share.  I
worked on a project[1] that used to extract resumes in Word/Word
Perfect/etc. documents via automation so we could pass them through an
expert system to extract the information. The WinWord process
constantly crashed and locked the service.

Eventually, after trying several commercial conversion tools
(including several supposed to be used in batch conversion or
server-based setups), nothing was working.

Then I hit on the radical idea that "if it's good enough for
index-server[2], it's good enough for me" and used the installed
IFilter drivers to suck out the text of any file we had an IFilter
driver (and dude, are there tons of them available for free). I wrote
a little COM component in C++ that simply defers to the shell to load
the correct driver and then ignored all the "formatting" information
and kept the text, which is returned as a BSTR.  Optionally, you can
ask it to "clean the text" to normalize the Unicode encodings and
morphing digits-like characters to actual digits

If you are interested, I can post the source for this... it is still
in service to  this day and it really works well.

[1] http://www.sendouts.com
[2] 
http://msdn.microsoft.com/library/default.asp?url=/library/en-us/indexsrv/hh/indexsrv/ixufilt_94fm.asp

IFilters:
http://www.adobe.com/support/downloads/8122.htm
http://www.corel.com/support/ftpsite/pub/wordperfect/wpwin/8/cwps8.htm#
http://www.adobe.com/support/downloads/8126.htm
http://www.cad-company.nl/ifilter/
http://www.microsoft.com/sharepoint/techinfo/reskit/RTF_Filter.asp
http://www.microsoft.com/sharepoint/techinfo/reskit/XML_Filter.asp
http://www.naa.gov.au/Search/srchadm/help/default.htm#Top
http://www.mp3machine.com/software/MP3_Ifilter/=

--
"I am Dyslexic of Borg. Resistors are fertile. Prepare to have your
ass laminated." -- Dan Nitschke

Marc C. Brooks
http://musingmarc.blogspot.com

===================================
This list is hosted by DevelopMentor(r)  http://www.develop.com

View archives and manage your subscription(s) at http://discuss.develop.com



--
Sébastien
www.sebastienlorion.com

Reply via email to