Using the RTC definitions, ES, Hadoop, the JVM and the popular OS themselves are "soft"/near real-time systems - so if you are coming from a hard/firm RT system, you can safely assume that everything (and again not just ES) is "soft". As a tangent, very few systems are hard RT (ES is neither a nuclear factory nor a peacemaker).
As ES-Hadoop is just a connector for ES, the real-time aspect of ES influences directly es-hadoop. I don't have any numbers at hand however there are some aspects that you need to be aware. Much of the real-time behaviour when it comes to _search_ is handled through the refresh API [1]. So when data is ingested into the system, depending on your index settings (how many replicas, what's the replication process - sync vs async, all vs n/2+1), the amount of data ingested and your hardware, your data might be searcheable faster or slower. There are so many variables here that are non standardized that the only way to find out for yourself is to to do your own benchmark, which is what we recommend: take a typical box, set it up, hammer it with data and you get a base-line. Based on you have figure on how big your cluster needs to be and as a side effect, how much budget you have left to improve performance (by throwing more hardware at it). Note however that get operations are performed in real-time and are not affected by refresh [2] - in other words data lookup is instantaneous vs search that can be delayed (as mentioned above). Hope this helps, [1] http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/indices-refresh.html [2] http://www.elasticsearch.org/guide/en/elasticsearch/reference/current/docs-get.html On Fri, Aug 1, 2014 at 12:36 PM, Pierre WP <[email protected]> wrote: > I have a question about what "near real-time" means exactly, in a quantified > way, when described this way on the ES-hadoop home page: > >> We are happy to report that es-hadoop is being used in multiple >> data-intensive environments; in a recent example, a large financial >> institute that stores all of their raw access logs in Hadoop – billions of >> documents – has been using es-hadoop to index the data into Elasticsearch >> and then visualize it using Kibana. This approach allowed the customer to >> have near real-time visibility into their data through Kibana > > > (http://www.elasticsearch.org/blog/es-hadoop-2-0-g/) > > I've been burned in the past by people throwing around the term "real-time" > in sloppy ways when what they really meant was update lag of many minutes. > (Coming from the hardware world we have a different way of using the term > "real-time" =D) > > I'm not saying that's the case here, I'm just asking for numerical > clarification. Naturally I assume it depends on the volume of data flow, the > server equipment, and the configuratinon settings. I've done about half an > hour of general searching without any definitive answers. Hopefully someone > either knows or can point me to a good resource. > > -- Pierre > > -- > You received this message because you are subscribed to the Google Groups > "elasticsearch" group. > To unsubscribe from this group and stop receiving emails from it, send an > email to [email protected]. > To view this discussion on the web visit > https://groups.google.com/d/msgid/elasticsearch/6f40a541-19b6-4a86-b02c-b07b1e3b17b3%40googlegroups.com. > For more options, visit https://groups.google.com/d/optout. -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/CAJogdmfmHYnGLQ4E14RM0da0Rv6JXhGFrvhVxqWi9O8%2BN8B40w%40mail.gmail.com. For more options, visit https://groups.google.com/d/optout.
