Re: Python Elasticsearch query not returning the expected results when running subsequent calls

G Kerekes Mon, 13 Jan 2014 06:24:50 -0800

Sorry Honza, I tried to somewhat anonimize my code but I was not
consistent. Basically I always query the same index, and my filter terms
are also consistent (original = original_amit2 and tns_survey_data =
survey_data).



2014/1/13 G Kerekes <[email protected]>

> Just noticed some typos in my code, please see the fixed one below (the
> queried index and filter terms were not consistent)
>
>
> On Monday, January 13, 2014 2:08:25 PM UTC, G Kerekes wrote:
>
>> Hi Honza,
>>
>> This is my "full" code:
>>
>> from elasticsearch import Elasticsearch
>> import json
>> import pandas as pd
>> import numpy as np
>> import os
>>
>>
>>
>> ### create the connection to the ES
>> es = Elasticsearch("host:port", timeout=600, max_retries=10, revival_delay=0)
>>
>>
>> ############################################################
>> ####### READ IN THE ORIGINAL SURVEY DATA ###################
>> ##############################
>> ##############################
>>
>> origall = es.search('survey_data' ,'primary',
>>                    body = {"query":
>>                         {"bool":
>>                             {"must":
>>                                 [{
>>                                     "term": {"file": "original"}
>>                                 }]
>>                                 }
>>                         }
>>                         ,"size" : "0"}
>>                     )
>>
>> total_o = origall['hits']['total']
>>
>> origall_o = es.search('survey_data','primary',
>>                    body = {"query":
>>                         {"bool":
>>                             {"must":
>>                                 [{
>>                                     "term": {"file": "original"}
>>                                 }]
>>                                 }
>>                         }
>>                         ,"size" : 20
>>
>>                     }
>> )
>>
>>
>> ## force it to data frame
>> orig_dict = origall_o['hits']['hits']
>>
>>
>> ############################################################
>> ####### READ IN THE NEW SURVEY DATA ########################
>> ############################################################
>>
>>
>> ### get the documents
>> newall = es.search('survey_data','primary',
>>                        {"query":
>>                            {
>>                           "bool":
>>                               {
>>                              "should":[
>>                                 {
>>                                    "term":{
>>                                       "file":"destinationqc22"
>>                                    }
>>                                 },
>>                                 {
>>                                    "term":{
>>                                       "file":"destinationqc33"
>>                                    }
>>                                 },
>>                                 {
>>                                    "term":{
>>                                       "file":"destinationqc44"
>>                                    }
>>                                 }
>>                              ]
>>                           }
>>                        }
>>                         ,"size" : "0"
>>                     }
>>  )
>>
>> total_n = newall['hits']['total']
>>
>> newall_n = es.search('survey_data','primary',
>>                        {"query":
>>                            {
>>                           "bool":
>>                               {
>>                              "should":[
>>                                 {
>>                                    "term":{
>>                                       "file":"destinationqc22"
>>                                    }
>>                                 },
>>                                 {
>>                                    "term":{
>>                                       "file":"destinationqc33"
>>                                    }
>>                                 },
>>                                 {
>>                                    "term":{
>>                                       "file":"destinationqc44"
>>                                    }
>>                                 }
>>                              ]
>>                           }
>>                        }
>>                         ,"size" : 20
>>                     }
>>  )
>>
>>
>> ## force it to data frame
>> new_dict = newall_n['hits']['hits']
>>
>> ##
>>
>> print(origall_o)
>> print(newall_n)
>>
>> print orig_dict
>>
>> print new_dict
>>
>> And then I run it I get this:
>>
>> >>> print(origall_o)
>> {u'hits': {u'hits': [], u'total': 110950, u'max_score': 0.7038795},
>> u'_shards': {u'successful': 3, u'failed': 0, u'total': 3}, u'took': 15,
>> u'timed_out': False}
>> >>> print(newall_n)
>> {u'hits': {u'hits': [], u'total': 110950, u'max_score': 0.7038795},
>> u'_shards': {u'successful': 3, u'failed': 0, u'total': 3}, u'took': 15,
>> u'timed_out': False}
>> >>>
>> >>> print orig_dict
>> []
>> >>>
>> >>> print new_dict
>> []
>> >>>
>>
>>
>> And what I would expect is:
>> origall_o total is correct (110k hits)
>> newall_n total should be 84k, not sure why it has the same 110k as for
>> the origall_o
>>
>> And for the orig_dict and new_dict I would expect to see those 20
>> documents that I query.
>>
>> Many thanks for your help.
>>
>>
>> Geza
>>
>>
>>
>> On Monday, January 13, 2014 12:16:53 PM UTC, Honza Král wrote:
>>>
>>> Hi Geza,
>>>
>>> I don't understand what you mean by re-running, can you post the
>>> complete code?
>>>
>>> When you do a search with size: 20, can you just print the result of
>>> the search method and see if that data is there?
>>>
>>> As a side note it looks like you are trying to filter out some data,
>>> while this works with a query you will get much better performance
>>> when using a filtered query and a filter instead of a query.
>>>
>>> Honza
>>>
>>> On Mon, Jan 13, 2014 at 10:38 AM, G Kerekes <[email protected]> wrote:
>>> > Hello,
>>> >
>>> > I am querying an elasticsearch index from python. Issue 1 is that when
>>> I
>>> > change my query and rerun it, my objects in Python don't get refreshed
>>> > according to my modified query. Issue 2 is that even if I see that I
>>> got
>>> > some hits, no data comes through at all (eg I see I've got 85k hits,
>>> but
>>> > when I put it in a dictionary, it is blank).
>>> >
>>> > from elasticsearch import Elasticsearch
>>> >
>>> > es = Elasticsearch("host:port", timeout=600, max_retries=10,
>>> > revival_delay=0)
>>> >
>>> >
>>> > origall = es.search('esdata' ,'primary',
>>> >                 {"query":
>>> >                     {"bool":
>>> >                         {"must_not":
>>> >                             [{
>>> >                                 "term": {"file": "original"}
>>> >                             }]
>>> >                             }
>>> >                     }
>>> >                     ,"size" : "0"}
>>> >                 )
>>> >
>>> > total_o = origall['hits']['total']
>>> >
>>> > At this stage for total_o I get 110k, which is correct. Then I rerun
>>> my
>>> > query after changing the size=0 to size=20, and if I want to have a
>>> look at
>>> > these 20 hits, I get nothing for this:
>>> >
>>> > orig = origall['hits']['hits']
>>> > print(orig)
>>> >
>>> > Then I go back to my original query and change the must_not to must.
>>> In this
>>> > way I should get 85k hits, but after rerunning it I still get 110k in
>>> > total_o.
>>> >
>>> > It is quite random when it works and when it doesn't. Sometimes I get
>>> my
>>> > expected 85k hits, but then this get stuck and when I change my query
>>> back
>>> > to get the 110k, it would still be 85k. Also sometimes I get data in
>>> my orig
>>> > = origall['hits']['hits'], but then let's say I change the size in my
>>> query
>>> > to 0, rerun it and the origall['hits']['hits'] will still give me back
>>> the
>>> > data.
>>> >
>>> > I use Anaconda, but tried also in Pycharm and the default Python IDLE,
>>> these
>>> > behave the same. Tried to create separate ES connections for all my
>>> queries,
>>> > doesn't help. Played around with cache, but no luck.
>>> >
>>> > I'm running it on a 64 bit, Windows 7 machine.
>>> >
>>> > Any idea what I'm doing wrong? Many thanks,
>>> >
>>> > Geza
>>> >
>>> > --
>>> > You received this message because you are subscribed to the Google
>>> Groups
>>> > "elasticsearch" group.
>>> > To unsubscribe from this group and stop receiving emails from it, send
>>> an
>>> > email to [email protected].
>>> > To view this discussion on the web visit
>>> > https://groups.google.com/d/msgid/elasticsearch/adf4f92a-
>>> 59f3-4189-ab87-8a2c13de7022%40googlegroups.com.
>>> > For more options, visit https://groups.google.com/groups/opt_out.
>>>
>>  --
> You received this message because you are subscribed to a topic in the
> Google Groups "elasticsearch" group.
> To unsubscribe from this topic, visit
> https://groups.google.com/d/topic/elasticsearch/Ld5XwSVP6ik/unsubscribe.
> To unsubscribe from this group and all its topics, send an email to
> [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/2a1eed86-eb4f-4459-93d1-a45ed499cc8a%40googlegroups.com
> .
>
> For more options, visit https://groups.google.com/groups/opt_out.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAEJuwWXhtXEPxVTPuR4x4HHV0ZO3bMsSxMeK7ZfvNHSWBSkyGw%40mail.gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: Python Elasticsearch query not returning the expected results when running subsequent calls

Reply via email to