Thanks ivan!
On Tue, Apr 8, 2014 at 1:09 PM, Ivan Brusic <[email protected]> wrote: > I do not think most users would expect the results in that order. The > character length does not provide relevance for most cases. Why is a > shorter word more relevant? I would say that most would rank "Happy > Together" higher since word proximity is a helpful metric. Happy should > rank first due to the length norm. > > You can always play around with the function score, but I rather deal with > non-dynamic metrics at indexing time. > > -- > Ivan > > > On Mon, Apr 7, 2014 at 8:23 AM, chee hoo lum <[email protected]> wrote: > >> Hi Ivan, >> >> Hmm... This seems like a viable workaround however just wanted to know is >> there any other ways to do it ? >> Because this doesn't seems like a unique problem i guess as most users >> will expect to get the similarity sorted (when performing search) based on >> the following order: >> >> 1.Happy >> 2.Be Happy >> 3.Be Happy >> 4.Happy Together >> >> It is live data in production.I had 180k documents resided in 5 shards >> within 5 nodes with one replica each. Even with 180k documents i still >> having this similarity order issue coupled with inconsistency issue due to >> it fetch from primary and replica intermittently. Therefore i need to use >> /media/_search?pretty=&search_type=dfs_query_then_fetch&preference=_primary >> to solve the inconsistency and now left with this sorting to be solve. >> >> Thanks. >> >> >> >> On Mon, Apr 7, 2014 at 7:13 AM, Ivan Brusic <[email protected]> wrote: >> >>> You can index the number of characters in your string into a new field >>> and then do a secondary sort on this field. >>> >>> Are you testing against real data or only against some test set? The >>> Lucene scoring model will improve with the addition of more documents. As >>> more documents are added, the term frequencies and inverse document >>> frequencies start to diverge and contribute more to the scoring. You will >>> not have many documents with the same score. >>> >>> -- >>> Ivan >>> >>> >>> On Sun, Apr 6, 2014 at 12:38 AM, <[email protected]> wrote: >>> >>>> >>>> Hi Ivan, >>>> >>>> Because I wanted the similiar result sorted in this way : >>>> >>>> 1. Be happy >>>> 2. Be happy >>>> 3. Happy ways >>>> >>>> Currently it is sorted : >>>> 1. Be happy >>>> 2. Happy ways >>>> 3. Be happy >>>> >>>> Due to that it return the same scoring. Any suggestion ? >>>> >>>> Thanks >>>> >>>> On 6 Apr, 2014, at 4:24 am, Ivan Brusic <[email protected]> wrote: >>>> >>>> Lucene will indeed, by default, give a higher score to shorter text, >>>> but the "shortness" is the number of tokens, not the number of characters. >>>> In your last example, each field has two tokens, so the length is the same. >>>> The term frequency is also the same for each document ("happy" appears >>>> once) and the inverse document frequency is the same (always the case with >>>> single term queries), so the score will be exactly the same for every >>>> document. Why should the scoring by any different? >>>> >>>> Cheers, >>>> >>>> Ivan >>>> >>>> >>>> >>>> On Fri, Apr 4, 2014 at 10:31 PM, chee hoo lum <[email protected]>wrote: >>>> >>>>> Hi Ivan, >>>>> >>>>> Since i not sure how analyzer with stopwords can be set in the query >>>>> itself. I tried to set the stopwords="_none_" via >>>>> index and its mapping : >>>>> >>>>> *Index settings: * >>>>> >>>>> { >>>>> "jdbc_dev": { >>>>> "settings": { >>>>> "index.analysis.analyzer.string_lowercase.filter": >>>>> "lowercase", >>>>> "index.number_of_replicas": "1", >>>>> "index.analysis.analyzer.string_lowercase.tokenizer": >>>>> "keyword", >>>>> "index.number_of_shards": "5", >>>>> "index.version.created": "900199", >>>>> * "index.analysis.analyzer.standard.type": "standard",* >>>>> * "index.analysis.analyzer.standard.stopwords": "_none_"* >>>>> } >>>>> } >>>>> } >>>>> >>>>> >>>>> *Type Mapping :* >>>>> >>>>> { >>>>> "media": { >>>>> "properties": { >>>>> "AUDIO": { >>>>> "type": "string" >>>>> }, >>>>> .... >>>>> "DISPLAY_NAME": { >>>>> "type": "string", >>>>> * "analyzer": "standard"* >>>>> }, >>>>> .... >>>>> } >>>>> } >>>>> >>>>> >>>>> *Query : * >>>>> >>>>> /media/_search?pretty=&search_type=dfs_query_then_fetch& >>>>> preference=_primary >>>>> >>>>> { >>>>> "from" : 0, >>>>> "size" : 100, >>>>> "explain" : true, >>>>> "query" : { >>>>> >>>>> "filtered" : { >>>>> "query" : { >>>>> "multi_match": { >>>>> "query": "happy", >>>>> "fields": [ "DISPLAY_NAME" ] >>>>> } >>>>> }, >>>>> "filter" : { >>>>> "query" : { >>>>> "bool" : { >>>>> "must" : { >>>>> "term" : { >>>>> "CHANNEL_ID" : "1" >>>>> } >>>>> } >>>>> } >>>>> } >>>>> } >>>>> } >>>>> } >>>>> >>>>> } >>>>> >>>>> >>>>> *Result : * >>>>> >>>>> 1) >>>>> "_shard": *4*, >>>>> "_node": "xsGVhtTnThaG57_mJdMtxg", >>>>> "_index": "jdbc_dev", >>>>> "_type": "media", >>>>> "_id": "127413", >>>>> "_score":* 6.614289*, >>>>> "_source": { >>>>> "DISPLAY_NAME": "*Be Happy*", >>>>> , >>>>> "_explanation": { >>>>> "value": 6.614289, >>>>> "description": "weight(DISPLAY_NAME:happy in 6485) >>>>> [PerFieldSimilarity], result of:", >>>>> "details": [ >>>>> { >>>>> "value": 6.614289, >>>>> "description": "fieldWeight in 6485, >>>>> product of:", >>>>> "details": [ >>>>> { >>>>> "value": 1, >>>>> "description": "tf(freq=1.0), with >>>>> freq of:", >>>>> "details": [ >>>>> { >>>>> "value": 1, >>>>> "description": >>>>> "termFreq=1.0" >>>>> } >>>>> ] >>>>> }, >>>>> { >>>>> "value": 10.582862, >>>>> "description": "idf(docFreq=93, >>>>> maxDocs=1364306)" >>>>> }, >>>>> { >>>>> "value": 0.625, >>>>> "description": >>>>> "fieldNorm(doc=6485)" >>>>> } >>>>> ] >>>>> } >>>>> ] >>>>> } >>>>> >>>>> >>>>> 2) >>>>> "_shard": *4*, >>>>> "_node": "UOjX2lxhR6mzfjHHmTm3cQ", >>>>> "_index": "jdbc_dev", >>>>> "_type": "media", >>>>> "_id": "72253", >>>>> "_score": *6.614289*, >>>>> "_source": { >>>>> "DISPLAY_NAME": *"Happy Ways*", >>>>> "_explanation": { >>>>> "value": 6.614289, >>>>> "description": "weight(DISPLAY_NAME:happy in 1102) >>>>> [PerFieldSimilarity], result of:", >>>>> "details": [ >>>>> { >>>>> "value": 6.614289, >>>>> "description": "fieldWeight in 1102, >>>>> product of:", >>>>> "details": [ >>>>> { >>>>> "value": 1, >>>>> "description": "tf(freq=1.0), with >>>>> freq of:", >>>>> "details": [ >>>>> { >>>>> "value": 1, >>>>> "description": >>>>> "termFreq=1.0" >>>>> } >>>>> ] >>>>> }, >>>>> { >>>>> "value": 10.582862, >>>>> "description": "idf(docFreq=93, >>>>> maxDocs=1364306)" >>>>> }, >>>>> { >>>>> "value": 0.625, >>>>> "description": >>>>> "fieldNorm(doc=1102)" >>>>> } >>>>> ] >>>>> } >>>>> ] >>>>> } >>>>> >>>>> >>>>> 3) >>>>> "_shard":* 4*, >>>>> "_node": "UOjX2lxhR6mzfjHHmTm3cQ", >>>>> "_index": "jdbc_dev", >>>>> "_type": "media", >>>>> "_id": "127413", >>>>> "_score": 6.614289, >>>>> "_source": { >>>>> "DISPLAY_NAME": "*Be Happy*", >>>>> "_explanation": { >>>>> "value": *6.614289*, >>>>> "description": "weight(DISPLAY_NAME:happy in 7277) >>>>> [PerFieldSimilarity], result of:", >>>>> "details": [ >>>>> { >>>>> "value": 6.614289, >>>>> "description": "fieldWeight in 7277, >>>>> product of:", >>>>> "details": [ >>>>> { >>>>> "value": 1, >>>>> "description": "tf(freq=1.0), with >>>>> freq of:", >>>>> "details": [ >>>>> { >>>>> "value": 1, >>>>> "description": >>>>> "termFreq=1.0" >>>>> } >>>>> ] >>>>> }, >>>>> { >>>>> "value": 10.582862, >>>>> "description": "idf(docFreq=93, >>>>> maxDocs=1364306)" >>>>> }, >>>>> { >>>>> "value": 0.625, >>>>> "description": >>>>> "fieldNorm(doc=7277)" >>>>> } >>>>> ] >>>>> } >>>>> ] >>>>> } >>>>> >>>>> >>>>> Notice that from 1,2,3 items the scores are the same *6.614289* even >>>>> though the DISPLAY_NAME is different >>>>> 1) Be Happy >>>>> 2) Happy Ways >>>>> 3) Be Happy >>>>> >>>>> It looks like it doesn't take into consideration the number of >>>>> character/length when it compute the score. I remember somewhere in the >>>>> document indicate that by default the algorithm should give higher score >>>>> to >>>>> the document that have shorter text on the searched field however this >>>>> doesn't seem like the case. Also i didn't manually disable the norm. >>>>> >>>>> Any suggestion that i could circumvent this issue ? >>>>> >>>>> >>>>> >>>>> -- >>>> You received this message because you are subscribed to a topic in the >>>> Google Groups "elasticsearch" group. >>>> To unsubscribe from this topic, visit >>>> https://groups.google.com/d/topic/elasticsearch/RXuuSlkDSyA/unsubscribe >>>> . >>>> To unsubscribe from this group and all its topics, send an email to >>>> [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCUB82B31DijLb9PNdrHmEzXP5JUWUepUp%3DDwSES9t%3DcQ%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCUB82B31DijLb9PNdrHmEzXP5JUWUepUp%3DDwSES9t%3DcQ%40mail.gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>>> -- >>>> You received this message because you are subscribed to the Google >>>> Groups "elasticsearch" group. >>>> To unsubscribe from this group and stop receiving emails from it, send >>>> an email to [email protected]. >>>> To view this discussion on the web visit >>>> https://groups.google.com/d/msgid/elasticsearch/589762DE-B343-470F-AC1D-C78119FCFB04%40gmail.com<https://groups.google.com/d/msgid/elasticsearch/589762DE-B343-470F-AC1D-C78119FCFB04%40gmail.com?utm_medium=email&utm_source=footer> >>>> . >>>> >>>> For more options, visit https://groups.google.com/d/optout. >>>> >>> >>> -- >>> You received this message because you are subscribed to a topic in the >>> Google Groups "elasticsearch" group. >>> To unsubscribe from this topic, visit >>> https://groups.google.com/d/topic/elasticsearch/RXuuSlkDSyA/unsubscribe. >>> To unsubscribe from this group and all its topics, send an email to >>> [email protected]. >>> To view this discussion on the web visit >>> https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQC%3D2mqt0OsbWQj8vfrpV3wim7z2ozVcXuyw5Uk9Lm-org%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQC%3D2mqt0OsbWQj8vfrpV3wim7z2ozVcXuyw5Uk9Lm-org%40mail.gmail.com?utm_medium=email&utm_source=footer> >>> . >>> >>> For more options, visit https://groups.google.com/d/optout. >>> >> >> >> >> -- >> Regards, >> >> Chee Hoo >> >> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected]. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/CAGS0%2Bg_C7UMU%3D3VmPdVoaKBOAwSa%2BwciKjajDm7prrJEDH7u7Q%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CAGS0%2Bg_C7UMU%3D3VmPdVoaKBOAwSa%2BwciKjajDm7prrJEDH7u7Q%40mail.gmail.com?utm_medium=email&utm_source=footer> >> . >> >> For more options, visit https://groups.google.com/d/optout. >> > > -- > You received this message because you are subscribed to a topic in the > Google Groups "elasticsearch" group. > To unsubscribe from this topic, visit > https://groups.google.com/d/topic/elasticsearch/RXuuSlkDSyA/unsubscribe. > To unsubscribe from this group and all its topics, send an email to > [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCGjs7koyQAdr9A%3DZoiQsCeWpSNKce892uoun29ZbBi8Q%40mail.gmail.com<https://groups.google.com/d/msgid/elasticsearch/CALY%3DcQCGjs7koyQAdr9A%3DZoiQsCeWpSNKce892uoun29ZbBi8Q%40mail.gmail.com?utm_medium=email&utm_source=footer> > . > > For more options, visit https://groups.google.com/d/optout. > -- Regards, Chee Hoo -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAGS0%2Bg-TX7F8KLCkbE1-1W6G_hJfeG7XSHorW%2B_6wkQtx8GKhw%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
