[ https://issues.apache.org/jira/browse/LUCENE-10522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524054#comment-17524054 ]
Dishant Sharma commented on LUCENE-10522: ----------------------------------------- Can I use the offsetAttribute somewhere in the code of patternCaptureTokenFilter.java file and set the start and end offsets as that of the match found in the input string? > issue with pattern capture group token filter > --------------------------------------------- > > Key: LUCENE-10522 > URL: https://issues.apache.org/jira/browse/LUCENE-10522 > Project: Lucene - Core > Issue Type: Task > Reporter: Dishant Sharma > Priority: Critical > > |The default pattern capture token filter in elastic search gives the same > start and end offset for each generated token: the start and end offset as > that of the input string. Is there any way by which I can change the start > and end offset of an input string to the positions at which they are found in > the input string? The issue that I'm currently facing is that in case of > highlighting, it highlights enter string instead of the match.| > The code inside my token filter factory file is: > > {{package pl.allegro.tech.elasticsearch.index.analysis.pl; > import org.apache.lucene.analysis.TokenStream; > import org.apache.lucene.analysis.pattern.PatternCaptureGroupTokenFilter; > import org.elasticsearch.common.settings.Settings; > import org.elasticsearch.env.Environment; > import org.elasticsearch.index.IndexSettings; > import org.elasticsearch.index.analysis.AbstractTokenFilterFactory; > import java.util.regex.Pattern; > public class PuAlPuTokenFilterFactory extends AbstractTokenFilterFactory \{ > public PuAlPuTokenFilterFactory(IndexSettings indexSettings, Environment > environment, String name, Settings settings) { > super(indexSettings, name, settings); > } > @Override > public TokenStream create(TokenStream tokenStream) \{ > return new PatternCaptureGroupTokenFilter(tokenStream, true, > Pattern.compile("(?<![^\\p{Alnum}\\p\{Punct}])(\\p\{Punct}\\p\{Alnum}+\\p\{Punct})")); > } > }}} > > I have multiple such token filter files inside my code containing the same > code as above but having different pattern used in each file inside the > PatternCaptureGroupTokenFilter method call. Each pattern is used as to get > the different set of tokens as per my requirement. > I am using the lucene's default PatternCaptureGroupTokenFilter. > I am not using any mapping but, I am using the below index settings as per my > use case: > "settings" : \{ > "analysis" : { > "analyzer" : { > "special_analyzer" : { > "tokenizer" : "whitespace", > "filter" : [ "url-filter-1", "url-filter-2", "url-filter-3", > "url-filter-4", "url-filter-5", "url-filter-6", "url-filter-7", > "url-filter-8", "url-filter-9", "url-filter-10", "url-filter-11", "unique" ] > } > } > } > } > > I am getting all the tokens using the regexes that I have created but the > only issue is that all the tokens have the same start and end offsets as that > of the input string. > I am using the pattern token filter alongwith the whitespace tokenizer. > Suppose I have a text: "Website url is [https://www.google.com/]" > Then, the desired tokens are: > Website, url, is, [https://www.google.com/], https, www, google, com, https:, > https:/, [https://|https:], /www, .google, .com, [www|http://www/]., google., > com/, www [google.com|http://google.com/] etc. > I am getting all these tokens through my regexes the only issue is with the > offsets. Suppose the start and end offsets of the entire url > "[https://www.google.com/]" are 0 and 23, so it is giving 0 and 23 for all > the generated tokens. > But, as per my use case, I'm using the highlighting functionality where I > have to use it to highlight all the generated tokens inside the text. But, > the issue here is that I instead of highlighting only the match inside the > text, it is highlighting the entire input text.| -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org