[ 
http://issues.apache.org/jira/browse/LUCENE-438?page=comments#action_12330250 ] 

Yonik Seeley commented on LUCENE-438:
-------------------------------------

Mostly to convey information across TokenFilters, and the single type string 
isn't sufficient.
For exampe, I'd like to have an int or long that can be used as 32 or 64 
independent flags.

In general, having attributes you can dynamically attach to tokens allows to 
you decompose token filters to more basic functions and thus gives greater 
power to filter chains.

Some use cases I can think of:
 - conditionals... mark tokens in one filter and conditionally act on them in 
another.

 - decouple the marking of tokens from the transformation of tokens... one 
could have a 
  TokenMatcherFilter that would tag certain tokens that matched a regex, for 
example.

 - protected tokens: mark certain words as "do not change", "do not stem", "do 
not lowercase" for instance.

 - mark tokens that are split from a larger token (for example when a camelCase 
filter splits "fooBar" into "foo Bar") so they may be treated differently by 
other filters
 
 - performance (hey it comes for free).  You can do things like 
StandardTokenFilter, which checks the type of the token and doesn't have to 
re-parse every single token.

I've already had to implement TokenFilter functionality (protected tokens, 
token splitting and combining)  where I've had to stuff more functionallity 
than I'd like into a single filter because of then inability of one filter to 
provide more info to another.

So I think there's a strong case for being able to dynamically add attributes 
(set bit flags) on a token.  I planned on subclassing Token to achieve that.  
But because I don't know what other people may need/want in the future, making 
it so one can provide extensions to Token via inheritance seems like a good 
thing.



> add Token.setTermText(), remove final
> -------------------------------------
>
>          Key: LUCENE-438
>          URL: http://issues.apache.org/jira/browse/LUCENE-438
>      Project: Lucene - Java
>         Type: Improvement
>     Versions: CVS Nightly - Specify date in submission
>     Reporter: Yonik Seeley
>     Priority: Minor
>  Attachments: yonik_Token.txt
>
> The Token class should be more friendly to classes not in it's package:
>   1) add setTermText()
>   2) remove final from class and toString()
>   3) add clone()
> Support for (1):
>   TokenFilters in the same package as Token are able to do things like 
>    "t.termText = t.termText.toLowerCase();" which is more efficient, but more 
> importantly less error prone.  Without the ability to change *only* the term 
> text, a new Token must be created, and one must remember to set all the 
> properties correctly.  This exact issue caused this bug:
> http://issues.apache.org/jira/browse/LUCENE-437
> Support for (2):
>   Removing final allows one to subclass Token.  I didn't see any performance 
> impact after removing final.
> I can go into more detail on why I want to subclass Token if anyone is 
> interested.
> Support for (3):
>   - support for a synonym TokenFilter, where one needs to make two tokens 
> from one (same args that support (1), and esp important if instance is a 
> subclass of Token).

-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators:
   http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see:
   http://www.atlassian.com/software/jira


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to