Thanks Jürgen for the detailed reply.

-We have considered on doing some processing at indexing time which is 
known like type, metadata extraction. But kept this limited due to impact 
on size and relied on-demand analytics for further expansion. One more 
aspect is building some pipeline and right now we have single entry 
processing so not looking at a window of messages for complex processing 
but yes this will need to built in near future.

- Thanks we will certainly consider cassandra option but right now going 
with ES as message store. Need to handle synchronization the moment we add 
one more store.

- Current average is ~5KB which comes to 3+ TB which is working out fine 
across 3 nodes for now. Since this is across few categorized indexes its 
working fine.
We will be adding few more nodes & replication for few indexes. We are 
using 16GB RAM nodes with quad core CPUs.

- Yes aggregations are being used in application for all normal stats and 
with some group by logic as well. And thats working perfect. 

- Yes we are checking on Spark & related complexity in deployment. I have 
different opinion here. ES/Solr are now eating into NoSQL space & there are 
different use-cases for which its getting used than just search. Search 
based analytics is becoming prominent as well. we don't need full fledged 
MR layer as such in ES/Solr but something more simpler. Some custom 
processing which can check each relevant entry and then decide what results 
should be returned to user. Since the candidates list are as per user query 
this needs to be done on-demand. 

Thanks,
Ram



On Friday, December 12, 2014 10:34:46 AM UTC+5:30, Ramchandra Phadake wrote:
>
> Hi,
>
> We are storing lots of mail messages in ES with multiple fields. 600 
> Millions+ messages across 3 ES nodes.
>
> There is a custom algorithm which works on batch of messages to correlate 
> based on fields & other message semantics. 
> Final result involves groups of messages returned similar to say field 
> collapsing type results. 
>
> Currently we fetch 100K+ messages from ES & apply this logic to return 
> final results to user. The algo can't be modeled using aggregations. 
>
> Obviously this is not scalable approach if say we want to process 100 M 
> messages as part of this processing & return results in few mins.The 
> messages are large & partitioned across few ES nodes. We want to main data 
> locality while processing so as not to download lots of data from ES over 
> network.
>
> Any way to execute some code over shards from within ES, fine if done as 
> part of postFilter as well. What are options available before thinking 
> about Hadoop/Spark using es-hadoop library? 
>
> Solr seems to be having such a plugin hook(experimental) for custom 
> processing. 
> https://cwiki.apache.org/confluence/display/solr/AnalyticsQuery+API
>
> Thanks,
> Ram
>
>
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/6ca7536a-12e8-4665-88d4-e0a2bd032fea%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to