Hello Fred,

I have clusters as large as 200billion documents/130TB. Sharing experiences
on that would require a book, but a couple quick things that jumped out at
me.

1. do not go the huge server route. Elasticasearch works best when you
scale it horizontally. The 64GB route is a much better option.

2. If I understand correctly you're routing an entire months data to a
single shard? By doing that you're directing all activity on that shard to
a single machine, or small set of machines if you have replicas. That has
to be much slower than if you were to do something like use a monthly index
with a reasonable number of shards to spread that load across the cluster.
That is also creating shard sizes that are fairly large and if you have
month to month variation in data rates you'll end up with "lumpy" shard
sizes which will definitely cause issues if you ever run your cluster low
on disk space.

3. Get off of ES 1.3 as fast as you can. 8TB spread across 37 machines is
very low density, as you push more data in you don't want to be on ES 1.3.

4. If you're not already using doc_values start looking into it now.
Managing heap memory is let's be nice and call it "a challenge" and
fielddata can eat heap in ways that will make your head spin.



Kimbro Staken


On Wed, Apr 22, 2015 at 1:14 AM, <fdevilla...@synthesio.com> wrote:

> Hi list,
>
> I've been using ES in production since 0.17.6 with clusters up to 64
> virtual machines and 20T data (including 3 replica). We're now thinking
> about pushing things a bit further and I wondered if people here had
> similar experience / needs as we do.
>
> Our current index is 1.1 billion unique documents, 8Tb data (including 1
> replica) on 37 physical machines (32 data nodes, 3 master nodes and 2 nodes
> dedicated to http requests) with ES 1.3 (upgrade to 1.5 already planned).
> We're indexing about 2500 new documents / second and everything's fine so
> far.
>
> Our goal is to index (and search) about 30 billion more documents (the
> backdata) + about 200 million new documents each month.
>
> Our company is providing analytics dashboards to their clients, and they
> mostly browse their data on a monthly scale, so we're routing documents
> monthly. Each shard makes between 200 and 250G. The index is made of 128
> shards, which makes about 10 years of data with 1 month per shard.
> Considering what we already have, we should reach 240T of data (and
> counting) with a single replica after we index all our backdata.
>
> So, my questions here:
>
> - Has someone here the same use / amount of data as we do?
>
> - Is ES the right technology to do realtime, ligthspeed queries (filtered
> queries and high cardinality agregations) on such an amount of data?
>
> - What were the traps to avoid? Is it better to add lots of medium
> machines (12 core  Xeon E5-1650 v2, 64G RAM, 1.8T SAS 15k hard drives) or a
> few huge machines with petabytes of RAM, terabytes of SSD and multiple ES
> processes?
>
> Any feedback on similar situation is indeed appreciated.
>
> Have a nice day,
> Fred
>
> --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/6865703f-2302-4fe0-b929-eb9fbe55a84a%40googlegroups.com
> <https://groups.google.com/d/msgid/elasticsearch/6865703f-2302-4fe0-b929-eb9fbe55a84a%40googlegroups.com?utm_medium=email&utm_source=footer>
> .
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAA0DmXZTqYgoKAKxLKGUeSXv_Mjjrer1dogaYARf1Ny7kio_3A%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to