Re: Is there a list of "special" characters for standard analyzer?

Phil Whelan Thu, 30 Jul 2009 21:36:13 -0700

On Thu, Jul 30, 2009 at 7:12 PM, <[email protected]> wrote:
> I was wonder if there is a list of special characters for the standard 
> analyzer?
>
> What I mean by "special" is characters that the analyzer considers break 
> characters.
> For example, if I have something like "foo=something", apparently the analyzer
> considers this as two terms, "foo" and "something.


Hi Jim,

This is what I could find in the docs...

StandardAnalyzer uses StandardTokenizer

http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/analysis/standard/StandardTokenizer.html
* Splits words at punctuation characters, removing punctuation.
However, a dot that's not followed by whitespace is considered part of
a token.
* Splits words at hyphens, unless there's a number in the token, in
which case the whole token is interpreted as a product number and is
not split.
* Recognizes email addresses and internet hostnames as one token.

Also, these are the tokens that will be removed..

  public static final String[] ENGLISH_STOP_WORDS = {
    "a", "an", "and", "are", "as", "at", "be", "but", "by",
    "for", "if", "in", "into", "is", "it",
    "no", "not", "of", "on", "or", "such",
    "that", "the", "their", "then", "there", "these",
    "they", "this", "to", "was", "will", "with"
  };

Thanks,
Phil

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Is there a list of "special" characters for standard analyzer?

Reply via email to