[jira] [Commented] (LUCENE-8181) WordDelimiterTokenFilter does not generate all tokens appropriately

Robin Stocker (JIRA) Sun, 11 Mar 2018 18:39:42 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-8181?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16394729#comment-16394729
 ]


Robin Stocker commented on LUCENE-8181:
---------------------------------------

I think this is the intended behavior of the filter at the moment. Having said 
that, it would be really useful for analyzing source code to have an option to 
generate those additional tokens.

Another interesting example to consider:
{code:java}
FooBar.baz_qux{code}
In this case, being able to produce the following tokens would be _really_ 
useful:

{{foo}}, {{bar}}, {{baz}}, {{qux}}, {{foobar}}, {{baz_qux}}, {{foobar.baz_qux}}

> WordDelimiterTokenFilter does not generate all tokens appropriately
> -------------------------------------------------------------------
>
>                 Key: LUCENE-8181
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8181
>             Project: Lucene - Core
>          Issue Type: Bug
>          Components: modules/analysis
>    Affects Versions: 7.2.1
>         Environment: *Steps to reproduce*:
> *1. Create index*
> _PUT testindex_
> _{_
> _"settings" : {_
> _"index" : {_
> _"number_of_shards" : 2,_
> _"number_of_replicas" : 2_
> _},_
> _"analysis": {_
> _"filter": {_
> _"wordDelimiter": {_
> _"type": "word_delimiter",_
> _"generate_word_parts": "true",_
> _"generate_number_parts": "true",_
> _"catenate_words": "false",_
> _"catenate_numbers": "false",_
> _"catenate_all": "false",_
> _"split_on_case_change": "true",_
> _"preserve_original": "true",_
> _"split_on_numerics": "true",_
> _"stem_english_possessive": "true"_
> _}_
> _},_
> _"analyzer": {_
> _"content_analyzer": {_
> _"type": "custom",_
> _"tokenizer": "whitespace",_
> _"filter": [_
> _"asciifolding",_
> _"wordDelimiter",_
> _"lowercase"_
> _]_
> _}_
> _}_
> _}_
> _}_
> _}_
> *2. Analyze Text-*
> _POST testindex/_analyze_
> _{_
> _"analyzer": "content_analyzer",_
> _"text": "ElasticSearch.TestProject"_
> _}_
> *Following tokens are generated-*
> {
> "token": "elasticsearch-testproject",
> "start_offset": 0,
> "end_offset": 25,
> "type": "word",
> "position": 0
> }
> ,
> {
> "token": "elastic",
> "start_offset": 0,
> "end_offset": 7,
> "type": "word",
> "position": 0
> }
> ,
> {
> "token": "search",
> "start_offset": 7,
> "end_offset": 13,
> "type": "word",
> "position": 1
> }
> ,
> {
> "token": "test",
> "start_offset": 14,
> "end_offset": 18,
> "type": "word",
> "position": 2
> }
> ,
> {
> "token": "project",
> "start_offset": 18,
> "end_offset": 25,
> "type": "word",
> "position": 3
> }
> *Expected Result:*
> Besides the above tokens even elasticsearch and testproject should be 
> generated. such that the phrase query "elasticsearch testproject" should also 
> match.
> *Another example could be-*
> Text *"Super-Duper-0-AutoCoder"* with above analyzer generates a token 
> *autocoder* while the text *"Super-Duper-AutoCoder"* does NOT generate the 
> token *autocoder*.
>            Reporter: Atul
>            Priority: Major
>
> When using word delimiter token filter some expected tokens are not generated.
> When I try to analyze the text "ElasticSearch.TestProject"
> I expect the tokens elastic, search, test, project, elasticsearch, 
> testproject, elasticsearch.testproject to be generated since I have 
> split_on_case_change, split_on_numerics on and using a whitespace tokenizer 
> and have preserve original true.
> But Actually I only see following tokens -
> elasticsearch.testproject, elastic, search, test, project



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-8181) WordDelimiterTokenFilter does not generate all tokens appropriately

Reply via email to