Re: Text extraction from ms word doc

Karl Wettin Mon, 11 Jan 2010 13:12:39 -0800

Have you tried antiword?

http://www.winfield.demon.nl/



      karl

11 jan 2010 kl. 21.04 skrev maxSchlein:


I was looking for an option for Text extraction from a word doc.

Currently I am using POI; however, when there is a table in the doc,foreach column POI brings back a . The whitespace analyzer is notfilteringout this character. So whatever word or phrase that is the lastword orphrase within a table column is not found during searching. Thatis, if the

word dog is the only word in a column, a search for the word dog would
return nothing because the word that was indexed was "dog".

I can create a filter to fix this, using Apache's
StringUtils.isAsciiPrintable, but I would rather not.

Any and all help is welcome and thanked.
--
View this message in context: 
http://old.nabble.com/Text-extraction-from-ms-word-doc-tp27116739p27116739.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-user-h...@lucene.apache.org

Re: Text extraction from ms word doc

Reply via email to