Hi,
you pass input.toString() to the matcher - this is the entire source
character stream to be tokenized; I think this would lead to the result you
saw.
If you'd like to match the pattern to the specific token (a substring of
the input), I think you may want to give the substring of the input string
to the matcher, like termAtt.append() do so in your code.
Also, I'd suggest including "^" and "$" in your regex to avoid
unintentional matches.

Tomoko


2022年5月3日(火) 16:47 dishant sharma <[email protected]>:

> I am creating a custom Pattern Tokenizer to change the type of the
> generated tokens. By incrementToken() function looks like the below code:
>
> public boolean incrementToken() {
>     if (index >= str.length()) return false;
>     clearAttributes();
>     if (group >= 0) {
>
>         // match a specific group
>         while (matcher.find()) {
>             index = matcher.start(group);
>             final int endIndex = matcher.end(group);
>             if (index == endIndex) continue;
>             termAtt.setEmpty().append(str, index, endIndex);
>             offsetAtt.setOffset(correctOffset(index), 
> correctOffset(endIndex));
>             //Changing Token Type based on the pattern matcher
>             Pattern pattern = Pattern.compile("\\p{Alnum}+");
>             Matcher matcher = pattern.matcher(input.toString());
>             boolean matchFound = matcher.find();
>             if (matchFound) {
>                 typeAttribute.setType("some_random_type".toLowerCase());
>             }
>             return true;
>         }
>     }
> }
>
> I'm trying to change the type of the generated tokens based on the
> condition that whenever the token encounters a particular regex, using the
> typeAttribute, the type of the token should be changed. Here, I am using
> the pattern "\p{Alnum}+", so whenever there is an alphanumeric token, its
> type should be changed.
>
> Currently, I am getting the token as:
>
> "tokens" : [ { "token" : "testing", "start_offset" : 0, "end_offset" : 7,
> "type" : "word", "position" : 0 }, ]
>
> I want the above token to be like:
>
> "tokens" : [ { "token" : "testing", "start_offset" : 0, "end_offset" : 7,
> "type" : "some_random_type", "position" : 0 }, ]
>
> Since the token matches with the pattern "\p{Alnum}+", the type of the
> token should be changed to the type specified inside the
> "typeAttribute.setType."
>
> But, the code that I have done is spitting out all the tokens of the type
> "some_random_type." If any token is not being matched with the pattern
> "\p{Alnum}+", it is also getting the type "some_random_type".
>
> How can I make only the specific tokens get the type "some_random_type"
> which matches the pattern "some_random_type".
>

Reply via email to