As an alternative to writing a custom tokenizer, you can use built-in PatternTypingFilter which does exactly this (sets type based on whether it matches some regex).
https://lucene.apache.org/core/9_1_0/analysis/common/org/apache/lucene/analysis/pattern/PatternTypingFilter.html On Tue, May 3, 2022 at 3:47 AM dishant sharma <[email protected]> wrote: > > I am creating a custom Pattern Tokenizer to change the type of the generated > tokens. By incrementToken() function looks like the below code: > > public boolean incrementToken() { > if (index >= str.length()) return false; > clearAttributes(); > if (group >= 0) { > > // match a specific group > while (matcher.find()) { > index = matcher.start(group); > final int endIndex = matcher.end(group); > if (index == endIndex) continue; > termAtt.setEmpty().append(str, index, endIndex); > offsetAtt.setOffset(correctOffset(index), > correctOffset(endIndex)); > //Changing Token Type based on the pattern matcher > Pattern pattern = Pattern.compile("\\p{Alnum}+"); > Matcher matcher = pattern.matcher(input.toString()); > boolean matchFound = matcher.find(); > if (matchFound) { > typeAttribute.setType("some_random_type".toLowerCase()); > } > return true; > } > } > } > > I'm trying to change the type of the generated tokens based on the condition > that whenever the token encounters a particular regex, using the > typeAttribute, the type of the token should be changed. Here, I am using the > pattern "\p{Alnum}+", so whenever there is an alphanumeric token, its type > should be changed. > > Currently, I am getting the token as: > > "tokens" : [ { "token" : "testing", "start_offset" : 0, "end_offset" : 7, > "type" : "word", "position" : 0 }, ] > > I want the above token to be like: > > "tokens" : [ { "token" : "testing", "start_offset" : 0, "end_offset" : 7, > "type" : "some_random_type", "position" : 0 }, ] > > Since the token matches with the pattern "\p{Alnum}+", the type of the token > should be changed to the type specified inside the "typeAttribute.setType." > > But, the code that I have done is spitting out all the tokens of the type > "some_random_type." If any token is not being matched with the pattern > "\p{Alnum}+", it is also getting the type "some_random_type". > > How can I make only the specific tokens get the type "some_random_type" which > matches the pattern "some_random_type". --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
