As an alternative to writing a custom tokenizer, you can use built-in
PatternTypingFilter which does exactly this (sets type based on
whether it matches some regex).

https://lucene.apache.org/core/9_1_0/analysis/common/org/apache/lucene/analysis/pattern/PatternTypingFilter.html

On Tue, May 3, 2022 at 3:47 AM dishant sharma
<[email protected]> wrote:
>
> I am creating a custom Pattern Tokenizer to change the type of the generated 
> tokens. By incrementToken() function looks like the below code:
>
> public boolean incrementToken() {
>     if (index >= str.length()) return false;
>     clearAttributes();
>     if (group >= 0) {
>
>         // match a specific group
>         while (matcher.find()) {
>             index = matcher.start(group);
>             final int endIndex = matcher.end(group);
>             if (index == endIndex) continue;
>             termAtt.setEmpty().append(str, index, endIndex);
>             offsetAtt.setOffset(correctOffset(index), 
> correctOffset(endIndex));
>             //Changing Token Type based on the pattern matcher
>             Pattern pattern = Pattern.compile("\\p{Alnum}+");
>             Matcher matcher = pattern.matcher(input.toString());
>             boolean matchFound = matcher.find();
>             if (matchFound) {
>                 typeAttribute.setType("some_random_type".toLowerCase());
>             }
>             return true;
>         }
>     }
> }
>
> I'm trying to change the type of the generated tokens based on the condition 
> that whenever the token encounters a particular regex, using the 
> typeAttribute, the type of the token should be changed. Here, I am using the 
> pattern "\p{Alnum}+", so whenever there is an alphanumeric token, its type 
> should be changed.
>
> Currently, I am getting the token as:
>
> "tokens" : [ { "token" : "testing", "start_offset" : 0, "end_offset" : 7, 
> "type" : "word", "position" : 0 }, ]
>
> I want the above token to be like:
>
> "tokens" : [ { "token" : "testing", "start_offset" : 0, "end_offset" : 7, 
> "type" : "some_random_type", "position" : 0 }, ]
>
> Since the token matches with the pattern "\p{Alnum}+", the type of the token 
> should be changed to the type specified inside the "typeAttribute.setType."
>
> But, the code that I have done is spitting out all the tokens of the type 
> "some_random_type." If any token is not being matched with the pattern 
> "\p{Alnum}+", it is also getting the type "some_random_type".
>
> How can I make only the specific tokens get the type "some_random_type" which 
> matches the pattern "some_random_type".

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to