Would it be simpler just to modify the input with a regex rather than risk messing with StandardANalyzer? Or wouldn't that do what you need?
On 1/11/07, Van Nguyen <[EMAIL PROTECTED]> wrote:
Hi, I need to modify the StandardAnalyzer so that it will tokenize zip codes that look like this: 92626-2646 I think the part I need to modify is in here - specifically: <HAS_DIGIT> <P> <ALPHANUM> // floating point, serial, model numbers, ip addresses, etc. // every other segment must have at least one digit | <NUM: (<ALPHANUM> <P> <HAS_DIGIT> | <HAS_DIGIT> <P> <ALPHANUM> | <HAS_DIGIT> <M> | <HAS_DIGIT> (<P> <HAS_DIGIT>)+ <M> | <LETTER> (<P> <LETTER>)+ | <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+ | <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+ | <ALPHANUM> <P> <HAS_DIGIT> (<P> <ALPHANUM> <P> <HAS_DIGIT>)+ | <HAS_DIGIT> <P> <ALPHANUM> (<P> <HAS_DIGIT> <P> <ALPHANUM>)+ ) > Is there a way to keep that line so that the StandardAnalyzer works as is - but tokenize anything that looks like (HAS_DIGITS) <P>) | (<HAS_DIGITS> <P> <HAS_DIGITS>) or even better: (<DIGIT><DIGIT><DIGIT><DIGIT><DIGIT><P>) | <DIGIT><DIGIT><DIGIT><DIGIT><DIGIT><P><DIGIT><DIGIT><DIGIT><DIGIT>) - I have zip codes that look like 92626, 92626-, and 92626-2646 I've tried adding that both lines to the "SKIP" section - but to no avail.