Atul created LUCENE-8181: ---------------------------- Summary: WordDelimiterTokenFilter does not generate all tokens appropriately Key: LUCENE-8181 URL: https://issues.apache.org/jira/browse/LUCENE-8181 Project: Lucene - Core Issue Type: Bug Components: modules/analysis Affects Versions: 7.2.1 Environment: *Steps to reproduce*: *1. Create index* _PUT testindex_ _{_ _"settings" : {_ _"index" : {_ _"number_of_shards" : 2,_ _"number_of_replicas" : 2_ _},_ _"analysis": {_ _"filter": {_ _"wordDelimiter": {_ _"type": "word_delimiter",_ _"generate_word_parts": "true",_ _"generate_number_parts": "true",_ _"catenate_words": "false",_ _"catenate_numbers": "false",_ _"catenate_all": "false",_ _"split_on_case_change": "true",_ _"preserve_original": "true",_ _"split_on_numerics": "true",_ _"stem_english_possessive": "true"_ _}_ _},_ _"analyzer": {_ _"content_analyzer": {_ _"type": "custom",_ _"tokenizer": "whitespace",_ _"filter": [_ _"asciifolding",_ _"wordDelimiter",_ _"lowercase"_ _]_ _}_ _}_ _}_ _}_ _}_
*2. Analyze Text-* _POST testindex/_analyze_ _{_ _"analyzer": "content_analyzer",_ _"text": "ElasticSearch.TestProject"_ _}_ *Following tokens are generated-* { "token": "elasticsearch-testproject", "start_offset": 0, "end_offset": 25, "type": "word", "position": 0 } , { "token": "elastic", "start_offset": 0, "end_offset": 7, "type": "word", "position": 0 } , { "token": "search", "start_offset": 7, "end_offset": 13, "type": "word", "position": 1 } , { "token": "test", "start_offset": 14, "end_offset": 18, "type": "word", "position": 2 } , { "token": "project", "start_offset": 18, "end_offset": 25, "type": "word", "position": 3 } *Expected Result:* Besides the above tokens even elasticsearch and testproject should be generated. such that the phrase query "elasticsearch testproject" should also match. *Another example could be-* Text *"Super-Duper-0-AutoCoder"* with above analyzer generates a token *autocoder* while the text *"Super-Duper-AutoCoder"* does NOT generate the token *autocoder*. Reporter: Atul When using word delimiter token filter some expected tokens are not generated. When I try to analyze the text "ElasticSearch.TestProject" I expect the tokens elastic, search, test, project, elasticsearch, testproject, elasticsearch.testproject to be generated since I have split_on_case_change, split_on_numerics on and using a whitespace tokenizer and have preserve original true. But Actually I only see following tokens - elasticsearch.testproject, elastic, search, test, project -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org