Hi, you pass input.toString() to the matcher - this is the entire source character stream to be tokenized; I think this would lead to the result you saw. If you'd like to match the pattern to the specific token (a substring of the input), I think you may want to give the substring of the input string to the matcher, like termAtt.append() do so in your code. Also, I'd suggest including "^" and "$" in your regex to avoid unintentional matches.
Tomoko 2022年5月3日(火) 16:47 dishant sharma <[email protected]>: > I am creating a custom Pattern Tokenizer to change the type of the > generated tokens. By incrementToken() function looks like the below code: > > public boolean incrementToken() { > if (index >= str.length()) return false; > clearAttributes(); > if (group >= 0) { > > // match a specific group > while (matcher.find()) { > index = matcher.start(group); > final int endIndex = matcher.end(group); > if (index == endIndex) continue; > termAtt.setEmpty().append(str, index, endIndex); > offsetAtt.setOffset(correctOffset(index), > correctOffset(endIndex)); > //Changing Token Type based on the pattern matcher > Pattern pattern = Pattern.compile("\\p{Alnum}+"); > Matcher matcher = pattern.matcher(input.toString()); > boolean matchFound = matcher.find(); > if (matchFound) { > typeAttribute.setType("some_random_type".toLowerCase()); > } > return true; > } > } > } > > I'm trying to change the type of the generated tokens based on the > condition that whenever the token encounters a particular regex, using the > typeAttribute, the type of the token should be changed. Here, I am using > the pattern "\p{Alnum}+", so whenever there is an alphanumeric token, its > type should be changed. > > Currently, I am getting the token as: > > "tokens" : [ { "token" : "testing", "start_offset" : 0, "end_offset" : 7, > "type" : "word", "position" : 0 }, ] > > I want the above token to be like: > > "tokens" : [ { "token" : "testing", "start_offset" : 0, "end_offset" : 7, > "type" : "some_random_type", "position" : 0 }, ] > > Since the token matches with the pattern "\p{Alnum}+", the type of the > token should be changed to the type specified inside the > "typeAttribute.setType." > > But, the code that I have done is spitting out all the tokens of the type > "some_random_type." If any token is not being matched with the pattern > "\p{Alnum}+", it is also getting the type "some_random_type". > > How can I make only the specific tokens get the type "some_random_type" > which matches the pattern "some_random_type". >
