Re: query_string bug in Elasticsearch-0.90.3, please tell me if it really is a bug ?

Mukul Gupta Wed, 05 Feb 2014 09:25:53 -0800

Adrien,

Regarding the boosting issue:
I have a field "text" and I'm using a query-time boost like
field=["text^30"]
Assume I have a doc like {text:"new delhi to goa"}. Now if I query for
"delhi to goa" then score for only term goa is boosted like goa^30 (as you
can see above in explain output) but what I expect is it should boost delhi
also like "delhi^30" which is not happening here. Is it like goa is not
analyzed so it will be considered as a term but delhi since it is analyzed
by analyzer it won't be considered as a term.


Thanks




On Wed, Feb 5, 2014 at 9:26 PM, Adrien Grand <[email protected]
> wrote:

> Hi,
>
> Indeed, query_string splits on whitespaces before applying the analyzer.
> You could try the match query[1] which doesn't have this flaw or the new
> simple_query_parser[2] which has the ability to disable the whitespace
> operator (just provide a list of flags that doesn't contain WHITESPACE).
>
> However I didn't understand your boosting issue, what query did you send
> to Elasticsearch?
>
> [1]
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-match-query.html
> [2]
> http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html#_simple_query_string_syntax
>
>
> On Wed, Feb 5, 2014 at 4:47 AM, coder <[email protected]> wrote:
>
>> I started using explain api for query_string but I guess in process I
>> found a bug (don't know if it really is a bug or intended behaviour of
>> query_string). This is going to be a long post, please be patient with me.
>>
>> I'm using a doc:{name:"new delhi to goa",st:"goa"}
>> On using analyzer api for indexing I got these tokens:
>>
>> {
>>   "tokens" : [ {
>>     "token" : "new",
>>     "start_offset" : 0,
>>     "end_offset" : 3,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new",
>>     "start_offset" : 0,
>>     "end_offset" : 9,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new ",
>>     "start_offset" : 0,
>>     "end_offset" : 9,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new d",
>>     "start_offset" : 0,
>>     "end_offset" : 9,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new de",
>>     "start_offset" : 0,
>>     "end_offset" : 9,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new del",
>>     "start_offset" : 0,
>>     "end_offset" : 9,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new delh",
>>     "start_offset" : 0,
>>     "end_offset" : 9,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new delhi",
>>     "start_offset" : 0,
>>     "end_offset" : 9,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new",
>>     "start_offset" : 0,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new ",
>>     "start_offset" : 0,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new d",
>>     "start_offset" : 0,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new de",
>>     "start_offset" : 0,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new del",
>>     "start_offset" : 0,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new delh",
>>     "start_offset" : 0,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new delhi",
>>     "start_offset" : 0,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new delhi ",
>>     "start_offset" : 0,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new delhi t",
>>     "start_offset" : 0,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new delhi to",
>>     "start_offset" : 0,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new",
>>     "start_offset" : 0,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new ",
>>     "start_offset" : 0,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new d",
>>     "start_offset" : 0,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new de",
>>     "start_offset" : 0,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new del",
>>     "start_offset" : 0,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new delh",
>>     "start_offset" : 0,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new delhi",
>>     "start_offset" : 0,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new delhi ",
>>     "start_offset" : 0,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new delhi t",
>>     "start_offset" : 0,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new delhi to",
>>     "start_offset" : 0,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new delhi to ",
>>     "start_offset" : 0,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new delhi to g",
>>     "start_offset" : 0,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new delhi to go",
>>     "start_offset" : 0,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "new delhi to goa",
>>     "start_offset" : 0,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "del",
>>     "start_offset" : 4,
>>     "end_offset" : 9,
>>     "type" : "word",
>>     "position" : 2
>>   }, {
>>     "token" : "delh",
>>     "start_offset" : 4,
>>     "end_offset" : 9,
>>     "type" : "word",
>>     "position" : 2
>>   }, {
>>     "token" : "delhi",
>>     "start_offset" : 4,
>>     "end_offset" : 9,
>>     "type" : "word",
>>     "position" : 2
>>   }, {
>>     "token" : "del",
>>     "start_offset" : 4,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 2
>>   }, {
>>     "token" : "delh",
>>     "start_offset" : 4,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 2
>>   }, {
>>     "token" : "delhi",
>>     "start_offset" : 4,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 2
>>   }, {
>>     "token" : "delhi ",
>>     "start_offset" : 4,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 2
>>   }, {
>>     "token" : "delhi t",
>>     "start_offset" : 4,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 2
>>   }, {
>>     "token" : "delhi to",
>>     "start_offset" : 4,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 2
>>   }, {
>>     "token" : "del",
>>     "start_offset" : 4,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 2
>>   }, {
>>     "token" : "delh",
>>     "start_offset" : 4,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 2
>>   }, {
>>     "token" : "delhi",
>>     "start_offset" : 4,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 2
>>   }, {
>>     "token" : "delhi ",
>>     "start_offset" : 4,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 2
>>   }, {
>>     "token" : "delhi t",
>>     "start_offset" : 4,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 2
>>   }, {
>>     "token" : "delhi to",
>>     "start_offset" : 4,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 2
>>   }, {
>>     "token" : "delhi to ",
>>     "start_offset" : 4,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 2
>>   }, {
>>     "token" : "delhi to g",
>>     "start_offset" : 4,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 2
>>   }, {
>>     "token" : "delhi to go",
>>     "start_offset" : 4,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 2
>>   }, {
>>     "token" : "delhi to goa",
>>     "start_offset" : 4,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 2
>>   }, {
>>     "token" : "to ",
>>     "start_offset" : 10,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 3
>>   }, {
>>     "token" : "to g",
>>     "start_offset" : 10,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 3
>>   }, {
>>     "token" : "to go",
>>     "start_offset" : 10,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 3
>>   }, {
>>     "token" : "to goa",
>>     "start_offset" : 10,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 3
>>   }, {
>>     "token" : "goa",
>>     "start_offset" : 13,
>>     "end_offset" : 16,
>>     "type" : "word",
>>     "position" : 4
>>   } ]
>> }
>>
>> Now, if I query like: "delhi to goa", I got this by search_analyzer:
>>
>> {
>>   "tokens" : [ {
>>     "token" : "del",
>>     "start_offset" : 0,
>>     "end_offset" : 5,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "delh",
>>     "start_offset" : 0,
>>     "end_offset" : 5,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "delhi",
>>     "start_offset" : 0,
>>     "end_offset" : 5,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "del",
>>     "start_offset" : 0,
>>     "end_offset" : 8,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "delh",
>>     "start_offset" : 0,
>>     "end_offset" : 8,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "delhi",
>>     "start_offset" : 0,
>>     "end_offset" : 8,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "delhi ",
>>     "start_offset" : 0,
>>     "end_offset" : 8,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "delhi t",
>>     "start_offset" : 0,
>>     "end_offset" : 8,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "delhi to",
>>     "start_offset" : 0,
>>     "end_offset" : 8,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "del",
>>     "start_offset" : 0,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "delh",
>>     "start_offset" : 0,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "delhi",
>>     "start_offset" : 0,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "delhi ",
>>     "start_offset" : 0,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "delhi t",
>>     "start_offset" : 0,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "delhi to",
>>     "start_offset" : 0,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "delhi to ",
>>     "start_offset" : 0,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "delhi to g",
>>     "start_offset" : 0,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "delhi to go",
>>     "start_offset" : 0,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "delhi to goa",
>>     "start_offset" : 0,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 1
>>   }, {
>>     "token" : "to ",
>>     "start_offset" : 6,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 2
>>   }, {
>>     "token" : "to g",
>>     "start_offset" : 6,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 2
>>   }, {
>>     "token" : "to go",
>>     "start_offset" : 6,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 2
>>   }, {
>>     "token" : "to goa",
>>     "start_offset" : 6,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 2
>>   }, {
>>     "token" : "goa",
>>     "start_offset" : 9,
>>     "end_offset" : 12,
>>     "type" : "word",
>>     "position" : 3
>>   } ]
>> }
>>
>> On using explain api, it gives me following:
>>
>> {text=new delhi to goa,boostFactor=9.820192307,po=9.82}
>> 510.39673 = custom score, product of:
>>   510.39673 = script score function: composed of:
>>     510.39673 = sum of:
>>
>>
>>       371.12375 = max of:
>>         371.12375 = sum of:
>>           104.61707 = weight(text:del in 1003990) [PerFieldSimilarity], 
>> result of:
>>             104.61707 = score(doc=1003990,freq=5.0 = termFreq=5.0
>> ), product of:
>>
>>
>>               0.43576795 = queryWeight, product of:
>>                 5.368244 = idf(docFreq=53067, maxDocs=4187328)
>>                 0.08117513 = queryNorm
>>               240.0752 = fieldWeight in 1003990, product of:
>>
>>
>>                 2.236068 = tf(freq=5.0), with freq of:
>>                   5.0 = termFreq=5.0
>>                 5.368244 = idf(docFreq=53067, maxDocs=4187328)
>>                 20.0 = fieldNorm(doc=1003990)
>>           133.24011 = weight(text:delh in 1003990) [PerFieldSimilarity], 
>> result of:
>>
>>
>>             133.24011 = score(doc=1003990,freq=5.0 = termFreq=5.0
>> ), product of:
>>               0.49178073 = queryWeight, product of:
>>                 6.058268 = idf(docFreq=26616, maxDocs=4187328)
>>                 0.08117513 = queryNorm
>>
>>
>>               270.934 = fieldWeight in 1003990, product of:
>>                 2.236068 = tf(freq=5.0), with freq of:
>>                   5.0 = termFreq=5.0
>>                 6.058268 = idf(docFreq=26616, maxDocs=4187328)
>>
>>
>>                 20.0 = fieldNorm(doc=1003990)
>>           133.26657 = weight(text:delhi in 1003990) [PerFieldSimilarity], 
>> result of:
>>             133.26657 = score(doc=1003990,freq=5.0 = termFreq=5.0
>> ), product of:
>>
>>
>>               0.49182954 = queryWeight, product of:
>>                 6.0588694 = idf(docFreq=26600, maxDocs=4187328)
>>                 0.08117513 = queryNorm
>>               270.96088 = fieldWeight in 1003990, product of:
>>
>>
>>                 2.236068 = tf(freq=5.0), with freq of:
>>                   5.0 = termFreq=5.0
>>                 6.0588694 = idf(docFreq=26600, maxDocs=4187328)
>>                 20.0 = fieldNorm(doc=1003990)
>>       139.27298 = max of:
>>
>>
>>         139.27298 = weight(text:goa^20.0 in 1003990) [PerFieldSimilarity], 
>> result of:
>>           139.27298 = score(doc=1003990,freq=3.0 = termFreq=3.0
>> ), product of:
>>             0.5712808 = queryWeight, product of:
>>
>>
>>               20.0 = boost
>>               7.037633 = idf(docFreq=9995, maxDocs=4187328)
>>               0.004058757 = queryNorm
>>             243.79076 = fieldWeight in 1003990, product of:
>>               1.7320508 = tf(freq=3.0), with freq of:
>>
>>
>>                 3.0 = termFreq=3.0
>>               7.037633 = idf(docFreq=9995, maxDocs=4187328)
>>               20.0 = fieldNorm(doc=1003990)
>>   1.0 = queryBoost
>>
>> Though the above explain shows the results for:
>> del
>> delh
>> delhi
>> goa
>>
>> But not getting results for other tokens which were generated by my
>> search analyzer. Why is it so ?
>>
>> I have read that query_string uses query parser which is based on Lucene
>> by default. So, My guess is query_string is using a whitespace tokenizer
>> after my tokens are generated by search analyzer, am I correct ? How can I
>> make query_string to calculate score for all the tokens which are generated
>> by search_analyzer.  Please correct me if I am wrong.
>>
>> There is one more things which I noticed,
>> I'm using a query time boost on one of my doc field but it is not working
>> the way I thought it would work. In the above explain you can see, there is
>> a boost associated with goa but not with delhi, though but goa and delhi
>> are present in original doc. My guess for this is,
>> query_string applies boost to only terms where a term is a token of a
>> user typed string which is not analyzed by any analyzer because in the
>> above example, goa is kept as it is but delhi is being analyzed. Am I
>> correct ?
>>
>> Waiting a reply !!!
>>
>> Thanks
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/10dd24df-fe87-430d-8433-73df1acb1d0c%40googlegroups.com
>> .
>> For more options, visit https://groups.google.com/groups/opt_out.
>>
>
>
>
> --
> Adrien Grand
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAL6Z4j7_LGPtbnM7yuNQoAjOR31kOKmddpnsJpuoEN2fssS1zw%40mail.gmail.com
> .
>
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAAVTvp4_PfRZPKqMTRZ34o4fKtdv7ROs%3DLsVsT3%2B3rLHHucfPg%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: query_string bug in Elasticsearch-0.90.3, please tell me if it really is a bug ?

Reply via email to