Hi,

We are storing lots of mail messages in ES with multiple fields. 600 
Millions+ messages across 3 ES nodes.

There is a custom algorithm which works on batch of messages to correlate 
based on fields & other message semantics. 
Final result involves groups of messages returned similar to say field 
collapsing type results. 

Currently we fetch 100K+ messages from ES & apply this logic to return 
final results to user. The algo can't be modeled using aggregations. 

Obviously this is not scalable approach if say we want to process 100 M 
messages as part of this processing & return results in few mins.The 
messages are large & partitioned across few ES nodes. We want to main data 
locality while processing so as not to download lots of data from ES over 
network.

Any way to execute some code over shards from within ES, fine if done as 
part of postFilter as well. What are options available before thinking 
about Hadoop/Spark using es-hadoop library? 

Solr seems to be having such a plugin hook(experimental) for custom 
processing. 
https://cwiki.apache.org/confluence/display/solr/AnalyticsQuery+API

Thanks,
Ram


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/f98a4bcb-2d9b-4aca-b49d-9afce519a69a%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to