We could also fix WhitespaceAnalyzer to filter that character out? (Or you could make your own analyzer to do so...).
You could also try asking on the tika-user list whether Tika has a solution for mapping "extended" whitespace characters... Mike On Mon, Jan 11, 2010 at 3:04 PM, maxSchlein <m_schl...@hotmail.com> wrote: > > I was looking for an option for Text extraction from a word doc. > > Currently I am using POI; however, when there is a table in the doc, for > each column POI brings back a . The whitespace analyzer is not filtering > out this character. So whatever word or phrase that is the last word or > phrase within a table column is not found during searching. That is, if the > word dog is the only word in a column, a search for the word dog would > return nothing because the word that was indexed was "dog ". > > I can create a filter to fix this, using Apache's > StringUtils.isAsciiPrintable, but I would rather not. > > Any and all help is welcome and thanked. > -- > View this message in context: > http://old.nabble.com/Text-extraction-from-ms-word-doc-tp27116739p27116739.html > Sent from the Lucene - Java Users mailing list archive at Nabble.com. > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org > For additional commands, e-mail: java-user-h...@lucene.apache.org > > --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org