Hi, I updated the gist now with a file in bulkindex format. I also split up the loading from the testing phase, so you can do the test multiple times in a row. I also added a README.md to instruct how to run the test.
I'm also creating a bug as stated here http://www.elasticsearch.org/blog/0-90-11-1-0-0-rc2-released/. On Wednesday, February 5, 2014 9:49:40 AM UTC+1, Jörg Prante wrote: > > Sorry, but your file at https://gist.github.com/8803745.git is broken, > it contains invalid JSON, so it can not be processed. > > It would be helpful to provide a script with escaped JSON in bulk format. > > From what I suspect, you do not use keyword analyzer for faceting/agg'ing, > so you will get all kinds of unwanted results. If that explains your > fluctuating aggs results, I can not tell. It is rather uncommon to use > "facets" and "aggs" side by side. > > Jörg > > > > On Tue, Feb 4, 2014 at 3:01 PM, Nils Dijk <[email protected] <javascript:>>wrote: > >> To follow up, >> >> I have a contained test suite at https://gist.github.com/thanodnl/8803745for >> this problem. It contains two files: >> >> 1. aggsbug.sh >> 2. aggsbug.json >> >> The .json file contains ~1M documents newline separated to load into the >> database, I was not able to create a curl request to load them directly >> into the index. >> The .sh file (https://gist.github.com/thanodnl/8803745/raw/aggsbug.sh) >> contains the instructions for recreating this behavior. >> >> I have ran these against the following version: >> >> 1. 1.0.0.Beta2 >> 2. 1.0.0.RC1 >> 3. 1.0.0-SNAPSHOT as compiled from the git 1.0 branch on commit >> 0f8b41ffad9b5ecdfd543d7c73edcf404e6fc763 >> >> When ran on 1.0.0.Beta2 it gives the same output consistently when I run >> the _search over and over again. >> When ran on 1.0.0.RC1 it will give me multiple different outcomes >> comparable to the numbers I posted earlier in the thread, >> When ran on 1.0.0-SNAPSHOT it behaves the same as in 1.0.0.RC1. >> >> That it still was working on 1.0.0.Beta2 proves to me that it is a bug >> that got into RC1. I could not find any related ticket on the issues page >> of the github repository. Hopefully this is enough information to recreate >> the problem. >> >> The json file is quite big and could bug when you open the gist it in a >> browser. A clone of the gist locally will work best: >> $ git clone https://gist.github.com/8803745.git >> >> I do not really know how to move on from here. Do you want me to open an >> issue for this problem at github.com/elasticsearch/elasticsearch? It >> would be nice to fix this problem before a release of 1.0.0 since that is >> the first release containing the aggregations for analytics. >> >> On Tuesday, February 4, 2014 12:31:10 PM UTC+1, Nils Dijk wrote: >> >>> I've loaded the same dataset in ES1.0.0.Beta2 with the same index >>> configuration as in the topic start. >>> >>> However now the numbers are consistent if I call the same aggregation >>> multiple times in a row AND the number match the numbers of the facets. >>> This leads me to the conclusion something is broken from Beta2 to RC1! >>> >>> I would like to test this on master, but I could not find any nightly >>> builds of elasticsearch. Is there a location where they are stored or >>> should I compile it myself? >>> >>> On Friday, January 31, 2014 6:43:07 PM UTC+1, Nils Dijk wrote: >>>> >>>> Hi Binh Ly, >>>> >>>> Thanks for the response. >>>> >>>> I'm aware that the numbers are not exact (hence the link to issue #1305 >>>> in my initial post), and have been advocating slightly incorrect numbers >>>> with my colleges and customers for some time already to prepare them for >>>> the moment we provide analytics with ES. But what bothers me is that they >>>> are *inconsistent*. >>>> >>>> If you look at my gist you see that I ran the same aggs 3 times right >>>> after each other. If we just look at the top item we see the following >>>> results: >>>> >>>> 1. { "key": "totaltrafficbos", "doc_count": 2880 } >>>> 2. { "key": "totaltrafficbos", "doc_count": 2552 } >>>> 3. { "key": "totaltrafficbos", "doc_count": 2179 } >>>> >>>> These results are taken within seconds without any change to the number of >>>> documents in the index. If I run them even more you see that it rotates >>>> between a hand full of numbers. Is this also behavior one would expect >>>> from the aggs? And if so, why do the facets show the same number over and >>>> over again? >>>> >>>> Anyway, I will try to work myself through the aggs code this weekend to >>>> get a better hang of what we could do with it, and what not. >>>> >>>> -- Nils >>>> >>>> On Friday, January 31, 2014 6:18:43 PM UTC+1, Binh Ly wrote: >>>>> >>>>> Nils, >>>>> >>>>> This is just the nature of splitting data around in shards. Actually >>>>> the terms facet has the same limitations (i.e. it will also give >>>>> "approximate counts"). Neither the terms facet nor the terms aggregation >>>>> is >>>>> better or worse than the other - they are both approximations (using >>>>> different implementations). It is correct that if you put all your data >>>>> in >>>>> 1 shard, then all the counts are exact. If you need to shard, you can >>>>> increase the "shard_size" parameter inside the terms aggregation to >>>>> "improve accuracy". Play with that number until it suits your purposes >>>>> but >>>>> the important thing is they are just approximations the more documents >>>>> you >>>>> have in the index - so just don't expect absolute numbers from them if >>>>> you >>>>> have more than 1 shard. >>>>> >>>>> { >>>>> "size": 0, >>>>> "aggs": { >>>>> "a": { >>>>> "terms": { >>>>> "field": "actor.displayName", >>>>> "shard_size": 10000 >>>>> } >>>>> } >>>>> } >>>>> } >>>>> >>>> -- >> You received this message because you are subscribed to the Google Groups >> "elasticsearch" group. >> To unsubscribe from this group and stop receiving emails from it, send an >> email to [email protected] <javascript:>. >> To view this discussion on the web visit >> https://groups.google.com/d/msgid/elasticsearch/fb421a29-8923-4188-9363-03682fec71ab%40googlegroups.com >> . >> >> For more options, visit https://groups.google.com/groups/opt_out. >> > > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/b911b272-53c6-4bd2-9185-4f66dfeb0de0%40googlegroups.com. For more options, visit https://groups.google.com/groups/opt_out.
