Hi, We are storing lots of mail messages in ES with multiple fields. 600 Millions+ messages across 3 ES nodes.
There is a custom algorithm which works on batch of messages to correlate based on fields & other message semantics. Final result involves groups of messages returned similar to say field collapsing type results. Currently we fetch 100K+ messages from ES & apply this logic to return final results to user. The algo can't be modeled using aggregations. Obviously this is not scalable approach if say we want to process 100 M messages as part of this processing & return results in few mins.The messages are large & partitioned across few ES nodes. We want to main data locality while processing so as not to download lots of data from ES over network. Any way to execute some code over shards from within ES, fine if done as part of postFilter as well. What are options available before thinking about Hadoop/Spark using es-hadoop library? Solr seems to be having such a plugin hook(experimental) for custom processing. https://cwiki.apache.org/confluence/display/solr/AnalyticsQuery+API Thanks, Ram -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/f98a4bcb-2d9b-4aca-b49d-9afce519a69a%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
