[jira] [Commented] (LUCENE-10522) issue with pattern capture group token filter

Dishant Sharma (Jira) Mon, 18 Apr 2022 22:05:07 -0700


    [ 
https://issues.apache.org/jira/browse/LUCENE-10522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17524054#comment-17524054
 ]


Dishant Sharma commented on LUCENE-10522:
-----------------------------------------

Can I use the offsetAttribute somewhere in the code of 
patternCaptureTokenFilter.java file and set the start and end offsets as that 
of the match found in the input string?

> issue with pattern capture group token filter
> ---------------------------------------------
>
>                 Key: LUCENE-10522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-10522
>             Project: Lucene - Core
>          Issue Type: Task
>            Reporter: Dishant Sharma
>            Priority: Critical
>
> |The default pattern capture token filter in elastic search gives the same 
> start and end offset for each generated token: the start and end offset as 
> that of the input string. Is there any way by which I can change the start 
> and end offset of an input string to the positions at which they are found in 
> the input string? The issue that I'm currently facing is that in case of 
> highlighting, it highlights enter string instead of the match.|
> The code inside my token filter factory file is:
>  
> {{package pl.allegro.tech.elasticsearch.index.analysis.pl;
> import org.apache.lucene.analysis.TokenStream;
> import org.apache.lucene.analysis.pattern.PatternCaptureGroupTokenFilter;
> import org.elasticsearch.common.settings.Settings;
> import org.elasticsearch.env.Environment;
> import org.elasticsearch.index.IndexSettings;
> import org.elasticsearch.index.analysis.AbstractTokenFilterFactory;
> import java.util.regex.Pattern;
> public class PuAlPuTokenFilterFactory extends AbstractTokenFilterFactory \{
>     public PuAlPuTokenFilterFactory(IndexSettings indexSettings, Environment 
> environment, String name, Settings settings) {
>         super(indexSettings, name, settings);
>     }
>     @Override
>     public TokenStream create(TokenStream tokenStream) \{
>         return new PatternCaptureGroupTokenFilter(tokenStream, true, 
> Pattern.compile("(?<![^\\p{Alnum}\\p\{Punct}])(\\p\{Punct}\\p\{Alnum}+\\p\{Punct})"));
>     }
> }}}
>  
> I have multiple such token filter files inside my code containing the same 
> code as above but having different pattern used in each file inside the 
> PatternCaptureGroupTokenFilter method call. Each pattern is used as to get 
> the different set of tokens as per my requirement.
> I am using the lucene's default PatternCaptureGroupTokenFilter.
> I am not using any mapping but, I am using the below index settings as per my 
> use case:
> "settings" : \{
>       "analysis" : {
>          "analyzer" : {
>             "special_analyzer" : {
>                "tokenizer" : "whitespace",
>                "filter" : [ "url-filter-1", "url-filter-2", "url-filter-3", 
> "url-filter-4", "url-filter-5", "url-filter-6", "url-filter-7", 
> "url-filter-8", "url-filter-9", "url-filter-10", "url-filter-11", "unique" ]
>             }
>          }
>       }
>    }
>  
> I am getting all the tokens using the regexes that I have created but the 
> only issue is that all the tokens have the same start and end offsets as that 
> of the input string.
> I am using the pattern token filter alongwith the whitespace tokenizer. 
> Suppose I have a text: "Website url is [https://www.google.com/]";
> Then, the desired tokens are:
> Website, url, is, [https://www.google.com/], https, www, google, com, https:, 
> https:/, [https://|https:], /www, .google, .com, [www|http://www/]., google., 
> com/, www [google.com|http://google.com/] etc.
> I am getting all these tokens through my regexes the only issue is with the 
> offsets. Suppose the start and end offsets of the entire url 
> "[https://www.google.com/]"; are 0 and 23, so it is giving 0 and 23 for all 
> the generated tokens.
> But, as per my use case, I'm using the highlighting functionality where I 
> have to use it to highlight all the generated tokens inside the text. But, 
> the issue here is that I instead of highlighting only the match inside the 
> text, it is highlighting the entire input text.|



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-10522) issue with pattern capture group token filter

Reply via email to