Re: 30 billion unique documents (and counting)

Kimbro Staken Wed, 22 Apr 2015 18:20:39 -0700

Running ES at scale is all about balance and sizing right. Like the 3
bears, not too big and not too small, just right. Big boxes will just be
wasted and too small of boxes will have you hitting limits too soon. Given
the way java works with heaps above 30GB-ish the best size for a node right
now seems to be around 64GB RAM. More RAM will be utilized by the OS cache
and definitely is a good thing but when you start to survey HW options and
consider other factors like power, rack density and per node cost big boxes
don't balance out.


The 50GB per shard number is just that, per shard. An individual node can
(and should) hold dozens of shards. Larger shard sizes will work too but
when a node crashes recovery of a larger number of 50GB shards will be much
faster than a smaller number of 200GB shards, especially in a large cluster.

Kimbro Staken

On Wed, Apr 22, 2015 at 6:04 PM, Jack Park <jackp...@topicquests.org> wrote:

> That starts to argue for lots of smaller servers maybe even with smaller
> SSD's. Say, a low power i3 with 16 or 23gb ram, and a 128gb SSD. Is that
> right?
>
> On Wed, Apr 22, 2015 at 2:56 PM, Mark Walkom <markwal...@gmail.com> wrote:
>
>> If you are using time series data then you should be using time series
>> indices. As Fred pointed out, routing an entire month's worth of data to a
>> single shard is not going to scale.
>>
>> Also, we recommend that you keep shard size below 50GB, this helps with
>> recovery and distribution. There is also a hard 2 billion doc per shard
>> limit in the underlying lucene engine, if you hit this then you may lose
>> data.
>>
>> On 23 April 2015 at 03:12, Kimbro Staken <ksta...@kstaken.com> wrote:
>>
>>> Hello Fred,
>>>
>>> I have clusters as large as 200billion documents/130TB. Sharing
>>> experiences on that would require a book, but a couple quick things that
>>> jumped out at me.
>>>
>>> 1. do not go the huge server route. Elasticasearch works best when you
>>> scale it horizontally. The 64GB route is a much better option.
>>>
>>> 2. If I understand correctly you're routing an entire months data to a
>>> single shard? By doing that you're directing all activity on that shard to
>>> a single machine, or small set of machines if you have replicas. That has
>>> to be much slower than if you were to do something like use a monthly index
>>> with a reasonable number of shards to spread that load across the cluster.
>>> That is also creating shard sizes that are fairly large and if you have
>>> month to month variation in data rates you'll end up with "lumpy" shard
>>> sizes which will definitely cause issues if you ever run your cluster low
>>> on disk space.
>>>
>>> 3. Get off of ES 1.3 as fast as you can. 8TB spread across 37 machines
>>> is very low density, as you push more data in you don't want to be on ES
>>> 1.3.
>>>
>>> 4. If you're not already using doc_values start looking into it now.
>>> Managing heap memory is let's be nice and call it "a challenge" and
>>> fielddata can eat heap in ways that will make your head spin.
>>>
>>>
>>>
>>> Kimbro Staken
>>>
>>>
>>> On Wed, Apr 22, 2015 at 1:14 AM, <fdevilla...@synthesio.com> wrote:
>>>
>>>> Hi list,
>>>>
>>>> I've been using ES in production since 0.17.6 with clusters up to 64
>>>> virtual machines and 20T data (including 3 replica). We're now thinking
>>>> about pushing things a bit further and I wondered if people here had
>>>> similar experience / needs as we do.
>>>>
>>>> Our current index is 1.1 billion unique documents, 8Tb data (including
>>>> 1 replica) on 37 physical machines (32 data nodes, 3 master nodes and 2
>>>> nodes dedicated to http requests) with ES 1.3 (upgrade to 1.5 already
>>>> planned). We're indexing about 2500 new documents / second and everything's
>>>> fine so far.
>>>>
>>>> Our goal is to index (and search) about 30 billion more documents (the
>>>> backdata) + about 200 million new documents each month.
>>>>
>>>> Our company is providing analytics dashboards to their clients, and
>>>> they mostly browse their data on a monthly scale, so we're routing
>>>> documents monthly. Each shard makes between 200 and 250G. The index is made
>>>> of 128 shards, which makes about 10 years of data with 1 month per shard.
>>>> Considering what we already have, we should reach 240T of data (and
>>>> counting) with a single replica after we index all our backdata.
>>>>
>>>> So, my questions here:
>>>>
>>>> - Has someone here the same use / amount of data as we do?
>>>>
>>>> - Is ES the right technology to do realtime, ligthspeed queries
>>>> (filtered queries and high cardinality agregations) on such an amount of
>>>> data?
>>>>
>>>> - What were the traps to avoid? Is it better to add lots of medium
>>>> machines (12 core  Xeon E5-1650 v2, 64G RAM, 1.8T SAS 15k hard drives) or a
>>>> few huge machines with petabytes of RAM, terabytes of SSD and multiple ES
>>>> processes?
>>>>
>>>> Any feedback on similar situation is indeed appreciated.
>>>>
>>>> Have a nice day,
>>>> Fred
>>>>
>>>> --
>>>> You received this message because you are subscribed to the Google
>>>> Groups "elasticsearch" group.
>>>> To unsubscribe from this group and stop receiving emails from it, send
>>>> an email to elasticsearch+unsubscr...@googlegroups.com.
>>>> To view this discussion on the web visit
>>>> https://groups.google.com/d/msgid/elasticsearch/6865703f-2302-4fe0-b929-eb9fbe55a84a%40googlegroups.com
>>>> <https://groups.google.com/d/msgid/elasticsearch/6865703f-2302-4fe0-b929-eb9fbe55a84a%40googlegroups.com?utm_medium=email&utm_source=footer>
>>>> .
>>>> For more options, visit https://groups.google.com/d/optout.
>>>>
>>>
>>>  --
>>> You received this message because you are subscribed to the Google
>>> Groups "elasticsearch" group.
>>> To unsubscribe from this group and stop receiving emails from it, send
>>> an email to elasticsearch+unsubscr...@googlegroups.com.
>>> To view this discussion on the web visit
>>> https://groups.google.com/d/msgid/elasticsearch/CAA0DmXZTqYgoKAKxLKGUeSXv_Mjjrer1dogaYARf1Ny7kio_3A%40mail.gmail.com
>>> <https://groups.google.com/d/msgid/elasticsearch/CAA0DmXZTqYgoKAKxLKGUeSXv_Mjjrer1dogaYARf1Ny7kio_3A%40mail.gmail.com?utm_medium=email&utm_source=footer>
>>> .
>>>
>>> For more options, visit https://groups.google.com/d/optout.
>>>
>>
>>  --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/CAEYi1X-PC7L%2Be823M-6wR6ReRdV6zgt56WW0z0Uf_Vy62iNwrQ%40mail.gmail.com
>> <https://groups.google.com/d/msgid/elasticsearch/CAEYi1X-PC7L%2Be823M-6wR6ReRdV6zgt56WW0z0Uf_Vy62iNwrQ%40mail.gmail.com?utm_medium=email&utm_source=footer>
>> .
>>
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAH6s0fwv98bByXNbTaGXhfQnuAU%3DKfeR-ATEN0XWZb6zbGqGew%40mail.gmail.com
> <https://groups.google.com/d/msgid/elasticsearch/CAH6s0fwv98bByXNbTaGXhfQnuAU%3DKfeR-ATEN0XWZb6zbGqGew%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAA0DmXZ%3DVyK3aJ%2Bty%3DK_%2B9qdSeQ9Pm_s7kf6EozoddM5b6Z0eQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: 30 billion unique documents (and counting)

Reply via email to