Re: ES-Hadoop data query liveness "near real-time" quantified?

Costin Leau Fri, 01 Aug 2014 03:10:37 -0700

Using the RTC definitions, ES, Hadoop, the JVM and the popular OS
themselves are "soft"/near real-time systems - so if you are coming
from a hard/firm RT system, you can safely assume that everything (and
again not just ES) is "soft". As a tangent, very few systems are hard
RT (ES is neither a nuclear factory nor a peacemaker).

As ES-Hadoop is just a connector for ES, the real-time aspect of ES
influences directly es-hadoop. I don't have any numbers at hand
however there are some aspects that you need to be aware.

Much of the real-time behaviour when it comes to _search_ is handled
through the refresh API [1]. So when data is ingested into the system,
depending on your index settings (how many replicas, what's the
replication process - sync vs async, all vs n/2+1), the amount of data
ingested and your hardware, your data might be searcheable faster or
slower. There are so many variables here that are non standardized
that the only way to find out for yourself is to to do your own
benchmark, which is what we recommend: take a typical box, set it up,
hammer it with data and you get a base-line. Based on you have figure
on how big your cluster needs to be and as a side effect, how much
budget you have left to improve performance (by throwing more hardware
at it).

Note however that get operations are performed in real-time and are
not affected by refresh [2] - in other words data lookup is
instantaneous vs search that can be delayed (as mentioned above).

Hope this helps,

[1] 
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-refresh.html
[2] 
http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-get.html

On Fri, Aug 1, 2014 at 12:36 PM, Pierre WP <[email protected]> wrote:
> I have a question about what "near real-time" means exactly, in a quantified
> way, when described this way on the ES-hadoop home page:
>
>> We are happy to report that es-hadoop is being used in multiple
>> data-intensive environments; in a recent example, a large financial
>> institute that stores all of their raw access logs in Hadoop – billions of
>> documents – has been using es-hadoop to index the data into Elasticsearch
>> and then visualize it using Kibana. This approach allowed the customer to
>> have near real-time visibility into their data through Kibana
>
>
> (http://www.elasticsearch.org/blog/es-hadoop-2-0-g/)
>
> I've been burned in the past by people throwing around the term "real-time"
> in sloppy ways when what they really meant was update lag of many minutes.
> (Coming from the hardware world we have a different way of using the term
> "real-time" =D)
>
> I'm not saying that's the case here, I'm just asking for numerical
> clarification. Naturally I assume it depends on the volume of data flow, the
> server equipment, and the configuratinon settings. I've done about half an
> hour of general searching without any definitive answers. Hopefully someone
> either knows or can point me to a good resource.
>
> -- Pierre
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to [email protected].
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/6f40a541-19b6-4a86-b02c-b07b1e3b17b3%40googlegroups.com.
> For more options, visit https://groups.google.com/d/optout.

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAJogdmfmHYnGLQ4E14RM0da0Rv6JXhGFrvhVxqWi9O8%2BN8B40w%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: ES-Hadoop data query liveness "near real-time" quantified?

Reply via email to