I totally agree with you: IFilter is the right way to go. In the past
I've found this reference a good starting point:
http://www.codeproject.com/csharp/IFilter.asp

I've used it to gather text from PDF documents but it does apply to Word
ones as well. Pay attention to the fact that some IFilter components are
not reentrant (most notably Adobe's); so, in a multithreaded environment
(like ASP.Net), you should find a workaround to make it work fine. Btw
MS Word's IFilter should work well.

HTH,

Efran Cobisi
http://www.cobisi.com

Marc Brooks wrote:
Are there any issues if I just do a rename of the word doc from
file.doc to
file.txt, then open the file as a text document and parse if for the
data I
need?  I know that the Word document format is not in strait ASCII
text, but
it appears that the data itself is.

That is TOTALLY wrong, no offense... the Word document format is
actually a structured-storage document composed of a tree of elements
and each element is a list of text snippets (some used, some old
noise) in a nonlinear linked list. If you simply do a "strings" on the
file, you'll end up with a lot of unrelated text in apparently random
order.  Some of that text can even be from another unrelated document,
or prior versions of the document (or template it was horked from).

Seriously, if you want the text from a doc file, use IFilter. If you
need a .Net version just say so.

--
"I am Dyslexic of Borg. Resistors are fertile. Prepare to have your
ass laminated." -- Dan Nitschke

Marc C. Brooks
http://musingmarc.blogspot.com

===================================
This list is hosted by DevelopMentorĀ®  http://www.develop.com

View archives and manage your subscription(s) at
http://discuss.develop.com

===================================
This list is hosted by DevelopMentorĀ®  http://www.develop.com

View archives and manage your subscription(s) at http://discuss.develop.com

Reply via email to