Hi,

I'm testing/planning implementation for 16 TB data logs (1 month, daily 
indexes about 530GB/day). Indexes are deleted after 1 month (TTL is 1 
month).

The documents size vary from few bytes to 1MB (average of ~3 kb).

We have 2 data center, and the requirement is to provide access to dataset 
when one is down.

My current implementation looks like this:

  cluster.routing.allocation.awareness.attributes: datacenter

  cluster.routing.allocation.awareness.force.datacenter.values: 
datacenterA,datacenterB

So the indexes are located on nodes in datacenterA and datacenterB. There 
is 1 replica for each index, so the index/replica is  balanced between 
locations.

The problem A:

I have been offered a SAN storage space that could be provided to any of ES 
node machines. Now, it index/replica scenario, I need 2 * 16 TB = 32 TB 
disk storage. If in raid1, it makes 64TB "real world" disk storage.

Providing "independent, high quality" storage may (if ES would allow) 
reduce the size to required 16TB. I said "if ES would allow", because up to 
my current knowledge, nodes can not "share" dataset. If many nodes run on a 
common storage, they create own, unique path. Is that correct? 

 Could I run ES cluster where indexes have no replica, but still, nodeX 
failure does not affect accessibility of nodeXdataset to the Cluster?

In my current idea of indexes without replica scenario, powering off (or 
failure) of the "NodeXDatacenterA" would make datasetX unavailable to read 
in cluster, at least until I start NodeXDatacenterB that would have access 
to datasetX (the same path configuration). Of course NodeXDatacenterA and 
NodeXDatacenterB could not run both in the same time.

I just guess, that workaround suggested above is not "in the ES philosophy 
of shared storage and self-balancing". It would make upgrade of single node 
problematic, less fault-tolerant, etc.

 Facts that makes me think about this solution is, that I have available 
some "24-core, 64GH Ram, limited disk storage" machines and a 16TB SAN 
storage that I could mount to that machines.

 Do You have any suggestion of SAN storage usage? Is that a good idea at 
all?

The problem B: Design

My current idea of building the environment is to order N (6-8? or more) 
machines with big HDD's and run "normal ES cluster" with shards and 
replicas stored locally.

The question is: how many of them would be enough :)

Providing 24-core,64GB RAM and 4TB each it would make 4 machines to run 
minimal cluster settings in single Datacenter, and 8 machines total for 
both datacenters. What do you think about possible performance. 

Actually to be storage-safe I would go for 6-8 TB disk storage per machine. 
That would allow to run on "less than 4" nodes while operation in single 
datacenter.

I wonder if 64GB RAM would be enough. 

The whole process of acquiring new servers takes time - is there a "good 
practise" guide to determine minimum number of servers in the cluster?

 How many shards would You suggest?

Question C:

I have seen some performance advices to make "client" ES nodes as a machine 
without data storage so it would not suffer from I/O issues. If having 2 of 
them, how would you scale it?

Do you think it's worth having 2 client-only machines, or better 2 more 
"complete" nodes with data storage, as extra nodes to ES cluster (so 10 
instead of 8 nodes).



-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/0565daed-f398-48da-be62-8646844581d0%40googlegroups.com.
For more options, visit https://groups.google.com/d/optout.

Reply via email to