It won't do what I need. I may have something like: "All-In-One is located in 92226-4446 and has an E-A-R"
I want it to be tokenized as follows: all one located 92226 4446 E-A-R Right now... it is tokenizing it as this: all one located 92226-4446 E-A-R -----Original Message----- From: Erick Erickson [mailto:[EMAIL PROTECTED] Sent: Thursday, January 11, 2007 6:11 PM To: java-user@lucene.apache.org Subject: Re: Modifying StandardAnalyzer Would it be simpler just to modify the input with a regex rather than risk messing with StandardANalyzer? Or wouldn't that do what you need? On 1/11/07, Van Nguyen <[EMAIL PROTECTED]> wrote: > > Hi, > > > > I need to modify the StandardAnalyzer so that it will tokenize zip codes > that look like this: > > > > 92626-2646 > > > > I think the part I need to modify is in here - specifically: > > > > <HAS_DIGIT> <P> <ALPHANUM> > > > > // floating point, serial, model numbers, ip addresses, etc. > > // every other segment must have at least one digit > > | <NUM: (<ALPHANUM> <P> <HAS_DIGIT> > > | <HAS_DIGIT> <P> <ALPHANUM> > > | <HAS_DIGIT> <M> > > | <HAS_DIGIT> (<P> <HAS_DIGIT>)+ <M> > > | <LETTER> (<P> <LETTER>)+ > > | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+ > > | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+ > > | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+ > > | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+ > > ) > > > > > > > Is there a way to keep that line so that the StandardAnalyzer works as > is - but tokenize anything that looks like > > > > (HAS_DIGITS) <P>) | (<HAS_DIGITS> <P> <HAS_DIGITS>) or even better: > > > > (<DIGIT><DIGIT><DIGIT><DIGIT><DIGIT><P>) | > <DIGIT><DIGIT><DIGIT><DIGIT><DIGIT><P><DIGIT><DIGIT><DIGIT><DIGIT>) - I > have zip codes that look like 92626, 92626-, and 92626-2646 > > > > I've tried adding that both lines to the "SKIP" section - but to no > avail. > > > > > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]