Hello,
I'm working on a corpus of size approximately 10 million documents. The
issue I'm running into right now is that the top scoring documents that
come back from my query are essentially all the same result. I'm trying to
find a way to get back unique results.
I've looked into modeling the data differently with nested objects or
parent-child relationships, but neither layout seems to fit the bill. The
nested model won't work because some of the documents have too many closely
related objects. On the flip side there are also too many unique documents
for the parent-child relationship to fit.
I then tried the "top hits aggregation" and it's exactly what I'm looking
for, except the running time of the query is approximately 30x slower than
the query without the aggregation. Are there known performance issues with
"top hits"? Any ideas on what I should use to make these queries? Here's
the aggregation piece:
"aggs": {
"top-fingerprints": {
"terms": {
"field": "fingerprint",
"size": 50
},
"aggs": {
"top_tag_hits": {
"top_hits": {
"size": 1,
"_source": {
"include": [
"title"
]
}
}
}
}
}
}
Thanks,
Michael
--
You received this message because you are subscribed to the Google Groups
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email
to [email protected].
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/29fce15c-79b7-4756-b033-93e490204095%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.