Re: Relevancy sorting of result returned

chee hoo lum Wed, 09 Apr 2014 02:05:18 -0700

Thanks ivan!


On Tue, Apr 8, 2014 at 1:09 PM, Ivan Brusic <[email protected]> wrote:

> I do not think most users would expect the results in that order. The
> character length does not provide relevance for most cases. Why is a
> shorter word more relevant? I would say that most would rank "Happy
> Together" higher since word proximity is a helpful metric. Happy should
> rank first due to the length norm.
>
> You can always play around with the function score, but I rather deal with
> non-dynamic metrics at indexing time.
>
> --
> Ivan
>
>
> On Mon, Apr 7, 2014 at 8:23 AM, chee hoo lum <[email protected]> wrote:
>
>> Hi Ivan,
>>
>> Hmm... This seems like a viable workaround however just wanted to know is
>> there any other ways to do it ?
>> Because this doesn't seems like a unique problem i guess as most users
>> will expect to get the similarity sorted (when performing search) based on
>> the following order:
>>
>> 1.Happy
>> 2.Be Happy
>> 3.Be Happy
>> 4.Happy Together
>>
>> It is live data in production.I had 180k documents resided in 5 shards
>> within 5 nodes with one replica each. Even with 180k documents i still
>> having this similarity order issue coupled with inconsistency issue due to
>> it fetch from primary and replica intermittently. Therefore i need to use
>> /media/_search?pretty=&search_type=dfs_query_then_fetch&preference=_primary
>> to solve the inconsistency and now left with this sorting to be solve.
>>
>> Thanks.
>>
>>
>>
>>  On Mon, Apr 7, 2014 at 7:13 AM, Ivan Brusic <[email protected]> wrote:
>>
>>>  You can index the number of characters in your string into a new field
>>> and then do a secondary sort on this field.
>>>
>>> Are you testing against real data or only against some test set? The
>>> Lucene scoring model will improve with the addition of more documents. As
>>> more documents are added, the term frequencies and inverse document
>>> frequencies start to diverge and contribute more to the scoring. You will
>>> not have many documents with the same score.
>>>
>>> --
>>> Ivan
>>>
>>>
>>> On Sun, Apr 6, 2014 at 12:38 AM, <[email protected]> wrote:
>>>
>>>>
>>>> Hi Ivan,
>>>>
>>>> Because I wanted the similiar result sorted in this way :
>>>>
>>>> 1. Be happy
>>>> 2. Be happy
>>>> 3. Happy ways
>>>>
>>>> Currently it is sorted :
>>>> 1. Be happy
>>>> 2. Happy ways
>>>> 3. Be happy
>>>>
>>>> Due to that it return the same scoring. Any suggestion ?
>>>>
>>>> Thanks
>>>>
>>>> On 6 Apr, 2014, at 4:24 am, Ivan Brusic <[email protected]> wrote:
>>>>
>>>> Lucene will indeed, by default, give a higher score to shorter text,
>>>> but the "shortness" is the number of tokens, not the number of characters.
>>>> In your last example, each field has two tokens, so the length is the same.
>>>> The term frequency is also the same for each document ("happy" appears
>>>> once) and the inverse document frequency is the same (always the case with
>>>> single term queries), so the score will be exactly the same for every
>>>> document. Why should the scoring by any different?
>>>>
>>>> Cheers,
>>>>
>>>> Ivan
>>>>
>>>>
>>>>
>>>> On Fri, Apr 4, 2014 at 10:31 PM, chee hoo lum <[email protected]>wrote:
>>>>
>>>>> Hi Ivan,
>>>>>
>>>>> Since i not sure how analyzer with stopwords can be set in the query
>>>>> itself. I tried to set the stopwords="_none_" via
>>>>> index and its mapping :
>>>>>
>>>>> *Index settings: *
>>>>>
>>>>> {
>>>>>     "jdbc_dev": {
>>>>>         "settings": {
>>>>>             "index.analysis.analyzer.string_lowercase.filter":
>>>>> "lowercase",
>>>>>             "index.number_of_replicas": "1",
>>>>>             "index.analysis.analyzer.string_lowercase.tokenizer":
>>>>> "keyword",
>>>>>             "index.number_of_shards": "5",
>>>>>             "index.version.created": "900199",
>>>>>          *   "index.analysis.analyzer.standard.type": "standard",*
>>>>> *            "index.analysis.analyzer.standard.stopwords": "_none_"*
>>>>>         }
>>>>>     }
>>>>> }
>>>>>
>>>>>
>>>>> *Type Mapping :*
>>>>>
>>>>> {
>>>>>     "media": {
>>>>>         "properties": {
>>>>>             "AUDIO": {
>>>>>                 "type": "string"
>>>>>             },
>>>>>          ....
>>>>>          "DISPLAY_NAME": {
>>>>>                 "type": "string",
>>>>>               *  "analyzer": "standard"*
>>>>>             },
>>>>>          ....
>>>>>    }
>>>>> }
>>>>>
>>>>>
>>>>> *Query : *
>>>>>
>>>>> /media/_search?pretty=&search_type=dfs_query_then_fetch&
>>>>> preference=_primary
>>>>>
>>>>> {
>>>>>   "from" : 0,
>>>>>   "size" : 100,
>>>>>   "explain" : true,
>>>>>   "query" : {
>>>>>
>>>>>     "filtered" : {
>>>>>       "query" : {
>>>>>          "multi_match": {
>>>>>        "query": "happy",
>>>>>        "fields": [ "DISPLAY_NAME" ]
>>>>>     }
>>>>>       },
>>>>>       "filter" : {
>>>>>         "query" : {
>>>>>           "bool" : {
>>>>>           "must" : {
>>>>>             "term" : {
>>>>>               "CHANNEL_ID" : "1"
>>>>>             }
>>>>>           }
>>>>>         }
>>>>>         }
>>>>>       }
>>>>>     }
>>>>>   }
>>>>>
>>>>> }
>>>>>
>>>>>
>>>>> *Result : *
>>>>>
>>>>> 1)
>>>>>  "_shard": *4*,
>>>>>                 "_node": "xsGVhtTnThaG57_mJdMtxg",
>>>>>                 "_index": "jdbc_dev",
>>>>>                 "_type": "media",
>>>>>                 "_id": "127413",
>>>>>                 "_score":* 6.614289*,
>>>>>                 "_source": {
>>>>>                     "DISPLAY_NAME": "*Be Happy*",
>>>>>                 ,
>>>>>                 "_explanation": {
>>>>>                     "value": 6.614289,
>>>>>                     "description": "weight(DISPLAY_NAME:happy in 6485)
>>>>> [PerFieldSimilarity], result of:",
>>>>>                     "details": [
>>>>>                         {
>>>>>                             "value": 6.614289,
>>>>>                             "description": "fieldWeight in 6485,
>>>>> product of:",
>>>>>                              "details": [
>>>>>                                 {
>>>>>                                     "value": 1,
>>>>>                                     "description": "tf(freq=1.0), with
>>>>> freq of:",
>>>>>                                     "details": [
>>>>>                                         {
>>>>>                                             "value": 1,
>>>>>                                             "description":
>>>>> "termFreq=1.0"
>>>>>                                         }
>>>>>                                     ]
>>>>>                                 },
>>>>>                                 {
>>>>>                                     "value": 10.582862,
>>>>>                                     "description": "idf(docFreq=93,
>>>>> maxDocs=1364306)"
>>>>>                                 },
>>>>>                                 {
>>>>>                                     "value": 0.625,
>>>>>                                     "description":
>>>>> "fieldNorm(doc=6485)"
>>>>>                                 }
>>>>>                             ]
>>>>>                         }
>>>>>                     ]
>>>>>                 }
>>>>>
>>>>>
>>>>> 2)
>>>>>  "_shard": *4*,
>>>>>                 "_node": "UOjX2lxhR6mzfjHHmTm3cQ",
>>>>>                  "_index": "jdbc_dev",
>>>>>                 "_type": "media",
>>>>>                 "_id": "72253",
>>>>>                 "_score": *6.614289*,
>>>>>                 "_source": {
>>>>>                     "DISPLAY_NAME": *"Happy Ways*",
>>>>>                   "_explanation": {
>>>>>                     "value": 6.614289,
>>>>>                     "description": "weight(DISPLAY_NAME:happy in 1102)
>>>>> [PerFieldSimilarity], result of:",
>>>>>                     "details": [
>>>>>                         {
>>>>>                             "value": 6.614289,
>>>>>                             "description": "fieldWeight in 1102,
>>>>> product of:",
>>>>>                             "details": [
>>>>>                                 {
>>>>>                                     "value": 1,
>>>>>                                     "description": "tf(freq=1.0), with
>>>>> freq of:",
>>>>>                                     "details": [
>>>>>                                         {
>>>>>                                             "value": 1,
>>>>>                                             "description":
>>>>> "termFreq=1.0"
>>>>>                                         }
>>>>>                                     ]
>>>>>                                 },
>>>>>                                 {
>>>>>                                     "value": 10.582862,
>>>>>                                     "description": "idf(docFreq=93,
>>>>> maxDocs=1364306)"
>>>>>                                 },
>>>>>                                 {
>>>>>                                     "value": 0.625,
>>>>>                                     "description":
>>>>> "fieldNorm(doc=1102)"
>>>>>                                 }
>>>>>                             ]
>>>>>                         }
>>>>>                     ]
>>>>>                 }
>>>>>
>>>>>
>>>>> 3)
>>>>>  "_shard":* 4*,
>>>>>                 "_node": "UOjX2lxhR6mzfjHHmTm3cQ",
>>>>>                  "_index": "jdbc_dev",
>>>>>                 "_type": "media",
>>>>>                 "_id": "127413",
>>>>>                 "_score": 6.614289,
>>>>>                  "_source": {
>>>>>                     "DISPLAY_NAME": "*Be Happy*",
>>>>>                  "_explanation": {
>>>>>                     "value": *6.614289*,
>>>>>                     "description": "weight(DISPLAY_NAME:happy in 7277)
>>>>> [PerFieldSimilarity], result of:",
>>>>>                     "details": [
>>>>>                         {
>>>>>                             "value": 6.614289,
>>>>>                             "description": "fieldWeight in 7277,
>>>>> product of:",
>>>>>                              "details": [
>>>>>                                 {
>>>>>                                     "value": 1,
>>>>>                                     "description": "tf(freq=1.0), with
>>>>> freq of:",
>>>>>                                     "details": [
>>>>>                                         {
>>>>>                                             "value": 1,
>>>>>                                             "description":
>>>>> "termFreq=1.0"
>>>>>                                         }
>>>>>                                     ]
>>>>>                                 },
>>>>>                                 {
>>>>>                                     "value": 10.582862,
>>>>>                                     "description": "idf(docFreq=93,
>>>>> maxDocs=1364306)"
>>>>>                                 },
>>>>>                                 {
>>>>>                                     "value": 0.625,
>>>>>                                     "description":
>>>>> "fieldNorm(doc=7277)"
>>>>>                                 }
>>>>>                             ]
>>>>>                         }
>>>>>                     ]
>>>>>                 }
>>>>>
>>>>>
>>>>> Notice that from 1,2,3 items the scores are the same *6.614289* even
>>>>> though the DISPLAY_NAME is different
>>>>> 1) Be Happy
>>>>> 2) Happy Ways
>>>>> 3) Be Happy
>>>>>
>>>>> It looks like it doesn't take into consideration the number of
>>>>> character/length when it compute the score. I remember somewhere in the
>>>>> document indicate that by default the algorithm should give higher score 
>>>>> to
>>>>> the document that have shorter text on the searched field however this
>>>>> doesn't seem like the case. Also i didn't manually disable the norm.
>>>>>
>>>>> Any suggestion that i could circumvent this issue ?
>>>>>
>>>>>
>>>>>
>>>>>  --
>>>> You received this message because you are subscribed to a topic in the
>>>> Google Groups "elasticsearch" group.
>>>> To unsubscribe from this topic, visit
>>>> https://groups.google.com/d/topic/elasticsearch/RXuuSlkDSyA/unsubscribe
>>>> .
>>>> To unsubscribe from this group and all its topics, send an email to
>>>> [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCUB82B31DijLb9PNdrHmEzXP5JUWUepUp%3DDwSES9t%3DcQ%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCUB82B31DijLb9PNdrHmEzXP5JUWUepUp%3DDwSES9t%3DcQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>>  --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to [email protected].
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/elasticsearch/589762DE-B343-470F-AC1D-C78119FCFB04%40gmail.com<https://groups.google.com/d/msgid/elasticsearch/589762DE-B343-470F-AC1D-C78119FCFB04%40gmail.com?utm_medium=email&utm_source=footer>
>>>> .
>>>>
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to a topic in the
>>> Google Groups "elasticsearch" group.
>>> To unsubscribe from this topic, visit
>>> https://groups.google.com/d/topic/elasticsearch/RXuuSlkDSyA/unsubscribe.
>>> To unsubscribe from this group and all its topics, send an email to
>>> [email protected].
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQC%3D2mqt0OsbWQj8vfrpV3wim7z2ozVcXuyw5Uk9Lm-org%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQC%3D2mqt0OsbWQj8vfrpV3wim7z2ozVcXuyw5Uk9Lm-org%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>
>>
>> --
>> Regards,
>>
>> Chee Hoo
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to [email protected].
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CAGS0%2Bg_C7UMU%3D3VmPdVoaKBOAwSa%2BwciKjajDm7prrJEDH7u7Q%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CAGS0%2Bg_C7UMU%3D3VmPdVoaKBOAwSa%2BwciKjajDm7prrJEDH7u7Q%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/RXuuSlkDSyA/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCGjs7koyQAdr9A%3DZoiQsCeWpSNKce892uoun29ZbBi8Q%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCGjs7koyQAdr9A%3DZoiQsCeWpSNKce892uoun29ZbBi8Q%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>



-- 
Regards,

Chee Hoo

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAGS0%2Bg-TX7F8KLCkbE1-1W6G_hJfeG7XSHorW%2B_6wkQtx8GKhw%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: Relevancy sorting of result returned

Reply via email to