Re: 30 billion unique documents (and counting)

Mark Walkom Wed, 22 Apr 2015 14:57:20 -0700

If you are using time series data then you should be using time series
indices. As Fred pointed out, routing an entire month's worth of data to a
single shard is not going to scale.


Also, we recommend that you keep shard size below 50GB, this helps with
recovery and distribution. There is also a hard 2 billion doc per shard
limit in the underlying lucene engine, if you hit this then you may lose
data.

On 23 April 2015 at 03:12, Kimbro Staken <ksta...@kstaken.com> wrote:

> Hello Fred,
>
> I have clusters as large as 200billion documents/130TB. Sharing
> experiences on that would require a book, but a couple quick things that
> jumped out at me.
>
> 1. do not go the huge server route. Elasticasearch works best when you
> scale it horizontally. The 64GB route is a much better option.
>
> 2. If I understand correctly you're routing an entire months data to a
> single shard? By doing that you're directing all activity on that shard to
> a single machine, or small set of machines if you have replicas. That has
> to be much slower than if you were to do something like use a monthly index
> with a reasonable number of shards to spread that load across the cluster.
> That is also creating shard sizes that are fairly large and if you have
> month to month variation in data rates you'll end up with "lumpy" shard
> sizes which will definitely cause issues if you ever run your cluster low
> on disk space.
>
> 3. Get off of ES 1.3 as fast as you can. 8TB spread across 37 machines is
> very low density, as you push more data in you don't want to be on ES 1.3.
>
> 4. If you're not already using doc_values start looking into it now.
> Managing heap memory is let's be nice and call it "a challenge" and
> fielddata can eat heap in ways that will make your head spin.
>
>
>
> Kimbro Staken
>
>
> On Wed, Apr 22, 2015 at 1:14 AM, <fdevilla...@synthesio.com> wrote:
>
>> Hi list,
>>
>> I've been using ES in production since 0.17.6 with clusters up to 64
>> virtual machines and 20T data (including 3 replica). We're now thinking
>> about pushing things a bit further and I wondered if people here had
>> similar experience / needs as we do.
>>
>> Our current index is 1.1 billion unique documents, 8Tb data (including 1
>> replica) on 37 physical machines (32 data nodes, 3 master nodes and 2 nodes
>> dedicated to http requests) with ES 1.3 (upgrade to 1.5 already planned).
>> We're indexing about 2500 new documents / second and everything's fine so
>> far.
>>
>> Our goal is to index (and search) about 30 billion more documents (the
>> backdata) + about 200 million new documents each month.
>>
>> Our company is providing analytics dashboards to their clients, and they
>> mostly browse their data on a monthly scale, so we're routing documents
>> monthly. Each shard makes between 200 and 250G. The index is made of 128
>> shards, which makes about 10 years of data with 1 month per shard.
>> Considering what we already have, we should reach 240T of data (and
>> counting) with a single replica after we index all our backdata.
>>
>> So, my questions here:
>>
>> - Has someone here the same use / amount of data as we do?
>>
>> - Is ES the right technology to do realtime, ligthspeed queries (filtered
>> queries and high cardinality agregations) on such an amount of data?
>>
>> - What were the traps to avoid? Is it better to add lots of medium
>> machines (12 core  Xeon E5-1650 v2, 64G RAM, 1.8T SAS 15k hard drives) or a
>> few huge machines with petabytes of RAM, terabytes of SSD and multiple ES
>> processes?
>>
>> Any feedback on similar situation is indeed appreciated.
>>
>> Have a nice day,
>> Fred
>>
>> --
>> You received this message because you are subscribed to the Google Groups
>> "elasticsearch" group.
>> To unsubscribe from this group and stop receiving emails from it, send an
>> email to elasticsearch+unsubscr...@googlegroups.com.
>> To view this discussion on the web visit
>> https://groups.google.com/d/msgid/elasticsearch/6865703f-2302-4fe0-b929-eb9fbe55a84a%40googlegroups.com
>> <https://groups.google.com/d/msgid/elasticsearch/6865703f-2302-4fe0-b929-eb9fbe55a84a%40googlegroups.com?utm_medium=email&utm_source=footer>
>> .
>> For more options, visit https://groups.google.com/d/optout.
>>
>
>  --
> You received this message because you are subscribed to the Google Groups
> "elasticsearch" group.
> To unsubscribe from this group and stop receiving emails from it, send an
> email to elasticsearch+unsubscr...@googlegroups.com.
> To view this discussion on the web visit
> https://groups.google.com/d/msgid/elasticsearch/CAA0DmXZTqYgoKAKxLKGUeSXv_Mjjrer1dogaYARf1Ny7kio_3A%40mail.gmail.com
> <https://groups.google.com/d/msgid/elasticsearch/CAA0DmXZTqYgoKAKxLKGUeSXv_Mjjrer1dogaYARf1Ny7kio_3A%40mail.gmail.com?utm_medium=email&utm_source=footer>
> .
>
> For more options, visit https://groups.google.com/d/optout.
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/CAEYi1X-PC7L%2Be823M-6wR6ReRdV6zgt56WW0z0Uf_Vy62iNwrQ%40mail.gmail.com.
For more options, visit https://groups.google.com/d/optout.

Re: 30 billion unique documents (and counting)

Reply via email to