On Thu, Jul 30, 2009 at 7:12 PM, <oh...@cox.net> wrote: > I was wonder if there is a list of special characters for the standard > analyzer? > > What I mean by "special" is characters that the analyzer considers break > characters. > For example, if I have something like "foo=something", apparently the analyzer > considers this as two terms, "foo" and "something.
Hi Jim, This is what I could find in the docs... StandardAnalyzer uses StandardTokenizer http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/standard/StandardTokenizer.html * Splits words at punctuation characters, removing punctuation. However, a dot that's not followed by whitespace is considered part of a token. * Splits words at hyphens, unless there's a number in the token, in which case the whole token is interpreted as a product number and is not split. * Recognizes email addresses and internet hostnames as one token. Also, these are the tokens that will be removed.. public static final String[] ENGLISH_STOP_WORDS = { "a", "an", "and", "are", "as", "at", "be", "but", "by", "for", "if", "in", "into", "is", "it", "no", "not", "of", "on", "or", "such", "that", "the", "their", "then", "there", "these", "they", "this", "to", "was", "will", "with" }; Thanks, Phil --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org