Hi,
I'm testing/planning implementation for 16 TB data logs (1 month, daily indexes about 530GB/day). Indexes are deleted after 1 month (TTL is 1 month). The documents size vary from few bytes to 1MB (average of ~3 kb). We have 2 data center, and the requirement is to provide access to dataset when one is down. My current implementation looks like this: cluster.routing.allocation.awareness.attributes: datacenter cluster.routing.allocation.awareness.force.datacenter.values: datacenterA,datacenterB So the indexes are located on nodes in datacenterA and datacenterB. There is 1 replica for each index, so the index/replica is balanced between locations. The problem A: I have been offered a SAN storage space that could be provided to any of ES node machines. Now, it index/replica scenario, I need 2 * 16 TB = 32 TB disk storage. If in raid1, it makes 64TB "real world" disk storage. Providing "independent, high quality" storage may (if ES would allow) reduce the size to required 16TB. I said "if ES would allow", because up to my current knowledge, nodes can not "share" dataset. If many nodes run on a common storage, they create own, unique path. Is that correct? Could I run ES cluster where indexes have no replica, but still, nodeX failure does not affect accessibility of nodeXdataset to the Cluster? In my current idea of indexes without replica scenario, powering off (or failure) of the "NodeXDatacenterA" would make datasetX unavailable to read in cluster, at least until I start NodeXDatacenterB that would have access to datasetX (the same path configuration). Of course NodeXDatacenterA and NodeXDatacenterB could not run both in the same time. I just guess, that workaround suggested above is not "in the ES philosophy of shared storage and self-balancing". It would make upgrade of single node problematic, less fault-tolerant, etc. Facts that makes me think about this solution is, that I have available some "24-core, 64GH Ram, limited disk storage" machines and a 16TB SAN storage that I could mount to that machines. Do You have any suggestion of SAN storage usage? Is that a good idea at all? The problem B: Design My current idea of building the environment is to order N (6-8? or more) machines with big HDD's and run "normal ES cluster" with shards and replicas stored locally. The question is: how many of them would be enough :) Providing 24-core,64GB RAM and 4TB each it would make 4 machines to run minimal cluster settings in single Datacenter, and 8 machines total for both datacenters. What do you think about possible performance. Actually to be storage-safe I would go for 6-8 TB disk storage per machine. That would allow to run on "less than 4" nodes while operation in single datacenter. I wonder if 64GB RAM would be enough. The whole process of acquiring new servers takes time - is there a "good practise" guide to determine minimum number of servers in the cluster? How many shards would You suggest? Question C: I have seen some performance advices to make "client" ES nodes as a machine without data storage so it would not suffer from I/O issues. If having 2 of them, how would you scale it? Do you think it's worth having 2 client-only machines, or better 2 more "complete" nodes with data storage, as extra nodes to ES cluster (so 10 instead of 8 nodes). -- You received this message because you are subscribed to the Google Groups "elasticsearch" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. To view this discussion on the web visit https://groups.google.com/d/msgid/elasticsearch/0565daed-f398-48da-be62-8646844581d0%40googlegroups.com. For more options, visit https://groups.google.com/d/optout.
