Re: analytics on data stored in ES

Jürgen Wagner (DVT) Thu, 11 Dec 2014 23:05:39 -0800

Hi Ram,
  we have built something similar for a compliance analytics
application. Consider the following:


- The feeding pipeline should perform any tagging, extractions,
enrichments, classification as much as possible. The results will be
indexed. Usually, that takes care of some computationally intensive
tasks (e.g., complex entity extraction, relationship extraction) and
prepares for later analytics by providing proper entities to work on. As
messages usually don't change (i.e., once indexed, you will keep them
unchanged for the rest of their lifetime), spending a bit more compute
time in feeding is fine.

- You don't  have to store the original message contents in
Elasticsearch. Try Apache Cassandra and only index a message id in
Elasticsearch, that can be used to retrieve the original message from
Cassandra or simply from a file storage (in the case of
compliance/e-discovery, it tends to be an immutable file storage). In
our application, relevant meta-data is only about 60% of the source
volume, so storing original messages somewhere else would require only
about 38% of the Elasticsearch storage required for both.

- Your queries may become complex, but you can scale with more replica
and nodes, or simply more RAM as necessary. Unless you're talking about
SMS messages, three nodes seems tight.

- If you need to do some query-time analytics, fetch the candidate
records and use aggregations if possible. Aggregations may not do the
entire job, but simply help finding the candidates. You may want to run
a first query to obtain just aggregations without result hits, and then
run one or more queries to get the actual candidate sets. Querying
should be considered "cheap", so having multiple queries is fine.

- Now do the extra analytics on the query result set obtained. For this
purpose, you should to look into Apache Spark to handle fast in-memory
processing of this data set if you really have a number of small,
parallel jobs with a significant divergence of run-times. As the scaling
properties of Elasticsearch retrieval and the post-query processing will
most likely be quite different, I would not recommend using any form of
plug-in for Elasticsearch (or Solr).

- If I take the dimensioning from my application and calculate that for
600 M e-mail messages, I would get (average size of 10 kB ex
attachments, plus derived meta-data of approx. another 6 kB of text)
around 10 TB of raw data. Three nodes seems to be a bit short for this
application. I don't know about the RAM and CPU sizings in your case,
but you should consider going to a definitely larger number of nodes.

Some thoughts... your mileage may vary :-)

Best regards,
--Jürgen

On 12.12.2014 06:04, Ramchandra Phadake wrote:
> Hi,
>
> We are storing lots of mail messages in ES with multiple fields. 600
> Millions+ messages across 3 ES nodes.
>
> There is a custom algorithm which works on batch of messages to
> correlate based on fields & other message semantics.
> Final result involves groups of messages returned similar to say field
> collapsing type results.
>
> Currently we fetch 100K+ messages from ES & apply this logic to return
> final results to user. The algo can't be modeled using aggregations.
>
> Obviously this is not scalable approach if say we want to process 100
> M messages as part of this processing & return results in few mins.The
> messages are large & partitioned across few ES nodes. We want to main
> data locality while processing so as not to download lots of data from
> ES over network.
>
> Any way to execute some code over shards from within ES, fine if done
> as part of postFilter as well. What are options available before
> thinking about Hadoop/Spark using es-hadoop library?
>
> Solr seems to be having such a plugin hook(experimental) for custom
> processing.
> https://cwiki.apache.org/confluence/display/solr/AnalyticsQuery+API
>
> Thanks,
> Ram
>
>
> -- 
> You received this message because you are subscribed to the Google
> Groups "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send
> an email to [email protected]
> <mailto:[email protected]>.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/f98a4bcb-2d9b-4aca-b49d-9afce519a69a%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/f98a4bcb-2d9b-4aca-b49d-9afce519a69a%40googlegroups.com?utm_medium=email&utm_source=footer>.
> For more options, visit https://groups.google.com/d/optout.


-- 

Mit freundlichen Grüßen/Kind regards/Cordialement vôtre/Atentamente/С
уважением
*i.A. Jürgen Wagner*
Head of Competence Center "Intelligence"
& Senior Cloud Consultant

Devoteam GmbH, Industriestr. 3, 70565 Stuttgart, Germany
Phone: +49 6151 868-8725, Fax: +49 711 13353-53, Mobile: +49 171 864 1543
E-Mail: [email protected]
<mailto:[email protected]>, URL: www.devoteam.de
<http://www.devoteam.de/>

------------------------------------------------------------------------
Managing Board: Jürgen Hatzipantelis (CEO)
Address of Record: 64331 Weiterstadt, Germany; Commercial Register:
Amtsgericht Darmstadt HRB 6450; Tax Number: DE 172 993 071


-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/548A93B2.4080006%40devoteam.com.
For more options, visit https://groups.google.com/d/optout.

<<attachment: juergen_wagner.vcf>>

Re: analytics on data stored in ES

Reply via email to