Re: Incomprehensible (to me) tokenizing behavior

Doug Cutting Mon, 30 Dec 2002 14:13:48 -0800

Terry Steichen wrote:
> PS: Is this kind of thing (and more importantly, any other similar
> design issues) documented any place?


This one is described in the source code, with the comment:

  // floating point, serial, model numbers, ip addresses, etc.
  // every other segment must have at least one digit

PSS: What is the simplest way to alter this behavior to one that parses the
same regardless of the presence or absence of numeric characters?

According to:

http://jakarta.apache.org/lucene/docs/api/org/apache/lucene/analysis/standard/StandardTokenizer.html

"Many applications have specific tokenizer needs. If this tokenizer does not suit your application, please consider copying this source code directory to your project and maintaining your own grammar-based tokenizer."

You need to copy StandardTokenizer.jj, change its package statement, add some import statements, add a JavaCC task to your build.xml, and, finally, modify the clause following the above comment.

Doug

--
To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]>
For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>

Re: Incomprehensible (to me) tokenizing behavior

Reply via email to