Hi Ram, we have built something similar for a compliance analytics application. Consider the following:
- The feeding pipeline should perform any tagging, extractions, enrichments, classification as much as possible. The results will be indexed. Usually, that takes care of some computationally intensive tasks (e.g., complex entity extraction, relationship extraction) and prepares for later analytics by providing proper entities to work on. As messages usually don't change (i.e., once indexed, you will keep them unchanged for the rest of their lifetime), spending a bit more compute time in feeding is fine. - You don't have to store the original message contents in Elasticsearch. Try Apache Cassandra and only index a message id in Elasticsearch, that can be used to retrieve the original message from Cassandra or simply from a file storage (in the case of compliance/e-discovery, it tends to be an immutable file storage). In our application, relevant meta-data is only about 60% of the source volume, so storing original messages somewhere else would require only about 38% of the Elasticsearch storage required for both. - Your queries may become complex, but you can scale with more replica and nodes, or simply more RAM as necessary. Unless you're talking about SMS messages, three nodes seems tight. - If you need to do some query-time analytics, fetch the candidate records and use aggregations if possible. Aggregations may not do the entire job, but simply help finding the candidates. You may want to run a first query to obtain just aggregations without result hits, and then run one or more queries to get the actual candidate sets. Querying should be considered "cheap", so having multiple queries is fine. - Now do the extra analytics on the query result set obtained. For this purpose, you should to look into Apache Spark to handle fast in-memory processing of this data set if you really have a number of small, parallel jobs with a significant divergence of run-times. As the scaling properties of Elasticsearch retrieval and the post-query processing will most likely be quite different, I would not recommend using any form of plug-in for Elasticsearch (or Solr). - If I take the dimensioning from my application and calculate that for 600 M e-mail messages, I would get (average size of 10 kB ex attachments, plus derived meta-data of approx. another 6 kB of text) around 10 TB of raw data. Three nodes seems to be a bit short for this application. I don't know about the RAM and CPU sizings in your case, but you should consider going to a definitely larger number of nodes. Some thoughts... your mileage may vary :-) Best regards, --Jürgen On 12.12.2014 06:04, Ramchandra Phadake wrote: > Hi, > > We are storing lots of mail messages in ES with multiple fields. 600 > Millions+ messages across 3 ES nodes. > > There is a custom algorithm which works on batch of messages to > correlate based on fields & other message semantics. > Final result involves groups of messages returned similar to say field > collapsing type results. > > Currently we fetch 100K+ messages from ES & apply this logic to return > final results to user. The algo can't be modeled using aggregations. > > Obviously this is not scalable approach if say we want to process 100 > M messages as part of this processing & return results in few mins.The > messages are large & partitioned across few ES nodes. We want to main > data locality while processing so as not to download lots of data from > ES over network. > > Any way to execute some code over shards from within ES, fine if done > as part of postFilter as well. What are options available before > thinking about Hadoop/Spark using es-hadoop library? > > Solr seems to be having such a plugin hook(experimental) for custom > processing. > https://cwiki.apache.org/confluence/display/solr/AnalyticsQuery+API > > Thanks, > Ram > > > -- > You received this message because you are subscribed to the Google > Groups "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send > an email to [email protected] > <mailto:[email protected]>. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/f98a4bcb-2d9b-4aca-b49d-9afce519a69a%40googlegroups.com > <https://groups.google.com/d/msgid/elasticsearch/f98a4bcb-2d9b-4aca-b49d-9afce519a69a%40googlegroups.com?utm_medium=email&utm_source=footer>. > For more options, visit https://groups.google.com/d/optout. -- Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С уважением *i.A. Jürgen Wagner* Head of Competence Center "Intelligence" & Senior Cloud Consultant Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543 E-Mail: [email protected] <mailto:[email protected]>, URL: www.devoteam.de <http://www.devoteam.de/> ------------------------------------------------------------------------ Managing Board: Jürgen Hatzipantelis (CEO) Address of Record: 64331 Weiterstadt, Germany; Commercial Register: Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071 -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/548A93B2.4080006%40devoteam.com. For more options, visit https://groups.google.com/d/optout.
<<attachment: juergen_wagner.vcf>>
