Re: StandardAnalyzer question

Mark Miller Fri, 21 Jul 2006 12:51:14 -0700

I do not beleive so. If you look above you will see that #P is only used
when looking for a num: a host ip, a phone number, etc. You will be removing
that ability to recognize a "_" while rooting those tokens out. It will
still be parsed when tokenizing an EMAIL as well. I dont think this is the
behavior you want.


- Mark

On 7/21/06, Ngo, Anh (ISS Southfield) <[EMAIL PROTECTED]> wrote:



What is #LETTER definition in SnardarTokernize.jj?


I saw:

| <#P: ("_"|"-"|"/"|"."|",") >
| <#HAS_DIGIT:                                    // at least one digit
    (<LETTER>|<DIGIT>)*
    <DIGIT>
    (<LETTER>|<DIGIT>)*
  >


Should I remove "_" and recompile the source code?

Sincerely,


Anh Ngo

-----Original Message-----
From: Daniel Naber [mailto:[EMAIL PROTECTED]
Sent: Friday, July 21, 2006 2:49 PM
To: [email protected]
Subject: Re: StandardAnalyzer question

On Freitag 21 Juli 2006 16:16, Ngo, Anh (ISS Southfield) wrote:

> The lucene 2.0.0 StandardAnalyzer does treat the "_"(underscore) as a
> token. Is there a way I can make StandardAnalyzer don't tokenize for
> "_" or any given characters?

You need to add "_" to the #LETTER definition in StandardTokenizer.jj,
then
rebuild StandardTokenizer.java using the appropriate and task.

Regards
Daniel

--
http://www.danielnaber.de

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: StandardAnalyzer question

Reply via email to