Sorry, but your file at https://gist.github.com/8803745.git is broken, it contains invalid JSON, so it can not be processed.
It would be helpful to provide a script with escaped JSON in bulk format. >From what I suspect, you do not use keyword analyzer for faceting/agg'ing, so you will get all kinds of unwanted results. If that explains your fluctuating aggs results, I can not tell. It is rather uncommon to use "facets" and "aggs" side by side. Jörg On Tue, Feb 4, 2014 at 3:01 PM, Nils Dijk <[email protected]> wrote: > To follow up, > > I have a contained test suite at https://gist.github.com/thanodnl/8803745for > this problem. It contains two files: > > 1. aggsbug.sh > 2. aggsbug.json > > The .json file contains ~1M documents newline separated to load into the > database, I was not able to create a curl request to load them directly > into the index. > The .sh file (https://gist.github.com/thanodnl/8803745/raw/aggsbug.sh) > contains the instructions for recreating this behavior. > > I have ran these against the following version: > > 1. 1.0.0.Beta2 > 2. 1.0.0.RC1 > 3. 1.0.0-SNAPSHOT as compiled from the git 1.0 branch on commit > 0f8b41ffad9b5ecdfd543d7c73edcf404e6fc763 > > When ran on 1.0.0.Beta2 it gives the same output consistently when I run > the _search over and over again. > When ran on 1.0.0.RC1 it will give me multiple different outcomes > comparable to the numbers I posted earlier in the thread, > When ran on 1.0.0-SNAPSHOT it behaves the same as in 1.0.0.RC1. > > That it still was working on 1.0.0.Beta2 proves to me that it is a bug > that got into RC1. I could not find any related ticket on the issues page > of the github repository. Hopefully this is enough information to recreate > the problem. > > The json file is quite big and could bug when you open the gist it in a > browser. A clone of the gist locally will work best: > $ git clone https://gist.github.com/8803745.git > > I do not really know how to move on from here. Do you want me to open an > issue for this problem at github.com/elasticsearch/elasticsearch? It > would be nice to fix this problem before a release of 1.0.0 since that is > the first release containing the aggregations for analytics. > > On Tuesday, February 4, 2014 12:31:10 PM UTC+1, Nils Dijk wrote: > >> I've loaded the same dataset in ES1.0.0.Beta2 with the same index >> configuration as in the topic start. >> >> However now the numbers are consistent if I call the same aggregation >> multiple times in a row AND the number match the numbers of the facets. >> This leads me to the conclusion something is broken from Beta2 to RC1! >> >> I would like to test this on master, but I could not find any nightly >> builds of elasticsearch. Is there a location where they are stored or >> should I compile it myself? >> >> On Friday, January 31, 2014 6:43:07 PM UTC+1, Nils Dijk wrote: >>> >>> Hi Binh Ly, >>> >>> Thanks for the response. >>> >>> I'm aware that the numbers are not exact (hence the link to issue #1305 >>> in my initial post), and have been advocating slightly incorrect numbers >>> with my colleges and customers for some time already to prepare them for >>> the moment we provide analytics with ES. But what bothers me is that they >>> are *inconsistent*. >>> >>> If you look at my gist you see that I ran the same aggs 3 times right >>> after each other. If we just look at the top item we see the following >>> results: >>> >>> 1. { "key": "totaltrafficbos", "doc_count": 2880 } >>> 2. { "key": "totaltrafficbos", "doc_count": 2552 } >>> 3. { "key": "totaltrafficbos", "doc_count": 2179 } >>> >>> These results are taken within seconds without any change to the number of >>> documents in the index. If I run them even more you see that it rotates >>> between a hand full of numbers. Is this also behavior one would expect from >>> the aggs? And if so, why do the facets show the same number over and over >>> again? >>> >>> Anyway, I will try to work myself through the aggs code this weekend to get >>> a better hang of what we could do with it, and what not. >>> >>> -- Nils >>> >>> On Friday, January 31, 2014 6:18:43 PM UTC+1, Binh Ly wrote: >>>> >>>> Nils, >>>> >>>> This is just the nature of splitting data around in shards. Actually >>>> the terms facet has the same limitations (i.e. it will also give >>>> "approximate counts"). Neither the terms facet nor the terms aggregation is >>>> better or worse than the other - they are both approximations (using >>>> different implementations). It is correct that if you put all your data in >>>> 1 shard, then all the counts are exact. If you need to shard, you can >>>> increase the "shard_size" parameter inside the terms aggregation to >>>> "improve accuracy". Play with that number until it suits your purposes but >>>> the important thing is they are just approximations the more documents you >>>> have in the index - so just don't expect absolute numbers from them if you >>>> have more than 1 shard. >>>> >>>> { >>>> "size": 0, >>>> "aggs": { >>>> "a": { >>>> "terms": { >>>> "field": "actor.displayName", >>>> "shard_size": 10000 >>>> } >>>> } >>>> } >>>> } >>>> >>> -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/fb421a29-8923-4188-9363-03682fec71ab%40googlegroups.com > . > > For more options, visit https://groups.google.com/groups/opt_out. > -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAKdsXoEMMy4mkHPYhJYpsOwY-2TdHtS9vAS0Enu0U93jfkEFwQ%40mail.gmail.com. For more options, visit https://groups.google.com/groups/opt_out.
