Dnia 24 sie 2012 o godz. 17:05 Mark Nelson <[email protected]> napisał(a):

> On 08/24/2012 09:17 AM, Stephen Perkins wrote:
>> Morning Wido (and all),
>>
>>>> I'd like to see a "best" hardware config as well... however, I'm
>>>> interested in a SAS switching fabric where the nodes do not have any
>>>> storage (except possibly onboard boot drive/USB as listed below).
>>>> Each node would have a SAS HBA that allows it to access a LARGE jbod
>>>> provided by a HA set of SAS Switches
>>>> (http://www.lsi.com/solutions/Pages/SwitchedSAS.aspx). The drives are lun
>> masked for each host.
>>>>
>>>> The thought here is that you can add compute nodes, storage shelves,
>>>> and disks all independently.  With proper masking, you could provide
>> redundancy
>>>> to cover drive, node, and shelf failures.    You could also add disks
>>>> "horizontally" if you have spare slots in a shelf, and you could add
>>>> shelves "vertically" and increase the disk count available to existing
>> nodes.
>>>>
>>>
>>> What would the benefit be from building such a complex SAS environment?
>>> You'd be spending a lot of money on SAS switch, JBODs and cabling.
>>
>> Density.
>>
>
> Trying to balance between dense solutions with more failure points vs cheap 
> low density solutions is always tough.  Though not the densest solution out 
> there, we are starting to investigate performance on an SC847a chassis with 
> 36 hotswap drives in 4U (along with internal drives for the system).  Our 
> setup doesn't use SAS expanders which is nice bonus, though it does require a 
> lot of controllers.
>
>>> Your SPOF would still be your whole SAS setup.
>>
>> Well... I'm not sure I would consider it a single point of failure...  a
>> pair of cross-connected switches and 3-5 disk shelves.  Shelves can be
>> purchased with fully redundant internals (dual data paths etc to SAS
>> drives).  That is not even that important. If each shelf is just looked at
>> as JBOD, then you can group disks from different shelves into btrfs or
>> hardware RAID groups.  Or... you can look at each disk as its own storage
>> with its own OSD.
>>
>> A SAS switch going offline would have no impact since everything is cross
>> connected.
>>
>> A whole shelf can go offline and it would only appear as a single drive
>> failure in a RAID group (if disks groups are distributed properly).
>>
>> You can then get compute nodes fairly densely packed by purchasing
>> SuperMicro 2uTwin enclosures:
>>   http://www.supermicro.com/products/nfo/2UTwin2.cfm
>>
>> You can get 3 - 4 of those compute enclosure with dual SAS connectors (each
>> enclosure not necessarily fully populated initially). The beauty is that the
>> SAS interconnect is fast.   Much faster than Ethernet.
>>
>> Please bear in mind that I am looking to create a highly available and
>> scalable storage system that will fit in as small an area as possible and
>> draw as little power as possible.  The reasoning is that we co-locate all
>> our equipment at remote data centers.  Each rack (along with its associated
>> power and any needed cross connects) represents a significant ongoing
>> operational expense.  Therefore, for me, density and incremental scalability
>> are important.
>
> There are some pretty interesting solutions on the horizon from various 
> vendors that achieve a pretty decent amount of density.  Should be 
> interesting times ahead. :)

LSI/Netapp have nice 60xNL SAS drives in 4U solution with SAS
backplane, but this is always, a balance between price, and
performance with elasticity. Balance between low/middle price hardware
vs midrange/enterprise solutions.

I think Ceph was created to be cheaper solution. To give as, a chance,
to use storage servers, commodity hardware, without priced SAN
infrastructure behind, and a fast 10Gb Ethernet. That gives more
scalability, and ability, to scale out, not to scale in. Software like
Ceph, do the job, for hardware solutions.

>
>>
>>> And what is the benefit for having Ceph run on top of that? If you have all
>> the disks available to all the nodes, why not run ZFS?
>>> ZFS would give you better performance since what you are building would
>> actually be a local filesystem.
>>
>> There is no high availability here.  Yes... You can try to do old school
>> magic with SAN file systems, complicated clustering, and synchronous
>> replication, but a RAIN approach appeals to me.  That is what I see in Ceph.
>> Don't get me wrong... I love ZFS... but am trying to figure out a scalable
>> HA solution that looks like RAIN. (Am I missing a feature of ZFS)?
>>
>>> For risk spreading you should not interconnect all the nodes.
>>
>> I do understand this.  However, our operational setup will not allow
>> multiple racks at the beginning.  So... given the constraints of 1 rack
>> (with dual power and dual WAN links), I do not see that a pair of cross
>> connected SAS switches is any less reliable than a pair of cross connected
>> ethernet switches...
>>
>> As storage scales and we outgrow the single rack at a location, we can
>> overflow into a second rack etc.
>>
>>> The more complexity you add to the whole setup, the more likely it's to go
>> down completely at some point in time.
>>>
>>> I'm just trying to understand why you would want to run a distributed
>> filesystem on top of a bunch of direct attached disks.
>>
>> I guess I don't consider a SAN a bunch of direct attached disks.  The SAS
>> infrastructure is a SAN with SAS interconnects  (versus fiber,  iscsi or
>> infiniband)...  The disks are accessed via JBOD if desired... or you can put
>> RAID on top of a group of them.  The multiple shelves of drives are a way to
>> attempt to reduce the dependence on a single piece of hardware (i.e. it
>> becomes RAIN).
>>
>>> Again, if all the disks are attached locally you'd be better of by using
>> ZFS.
>>
>> This is not highly available, and AFAICT, the compute load would not scale
>> with the storage.
>>
>>>> My goal is to be able to scale without having to draw the enormous
>>>> power of lots of 1U devices or buy lots of disks and shelves each time
>>>> I wasn't to add a little capacity.
>>>>
>>>
>>> You can do that, scale by adding a 1U node with 2, 3 of 4 disks at the
>> time, depending on your crushmap you might need to add 3 machines at a once.
>>
>> Adding three machines at once is what I was trying to avoid (I believe that
>> I need 3 replicas to make things reasonably redundant).  From first glance,
>> it does not seem like a very dense solution to try to add a bunch of 1U
>> servers with a few disks.  The associated cost of a bunch of 1U Servers over
>> JBOD, plus (and more importantly) the rack space and power draw, can cause
>> OPEX problems.  I can purchase multiple enclosures, but not fully populate
>> them with disks/cpus.  This gives me a redundant array of nodes (RAIN).
>> Then. as needed, I can add drives or compute cards to the existing
>> enclosures for little incremental cost.
>>
>> In your 3 1U server case above, I can add 12 disks to existing 4 enclosures
>> (in groups of three) instead of three 1U servers with 4 disks each.  I can
>> then either run more OSDs on existing compute nodes or I can add one more
>> compute node and it can handle the new drives with one or more OSDs.  If I
>> run out of space in enclosures, I can add one more shelf (just one) and
>> start adding drives.  I can then "include" the new drives into existing OSDs
>> such that each existing OSD has a little more storage it needs to worry
>> about.  (The specifics of growing an existing OSD by adding a disk is still
>> a little fuzzy to me).
>>
>>>> Anybody looked at atom processors?
>>>>
>>>
>>> Yes, I have..
>>>
>>> I'm running Atom D525 (SuperMicro X7SPA-HF) nodes with 4GB of RAM and 4 2TB
>> disks and a 80GB SSD (old X25-M) for journaling.
>>>
>>> That works, but what I notice is that under heavy recover the Atoms can't
>> cope with it.
>>>
>>> I'm thinking about building a couple of nodes with the AMD Brazos
>> mainboard, somelike like an Asus E35M1-I.
>>>
>>> That is not a serverboard, but it would just be a reference to see what it
>> does.
>>>
>>> One of the problems with the Atoms is the 4GB memory limitation, with the
>> AMD Brazos you can use 8GB.
>>>
>>> I'm trying to figure out a way to have a really large amount of small nodes
>> for a low price to have
>>> a massive cluster where the impact of loosing one node is very small.
>>
>> Given that "massive" is a relative term, I am as well... but I'm also trying
>> to reduce the footprint (power and space) of that "massive" cluster.  I also
>> want to start small (1/2 rack) and scale as needed.
>
> If you do end up testing Brazos processes, please post your results!  I think 
> it really depends on what kind of performance you are aiming for.  Our stock 
> 2U test boxes have 6-core opterons, and our SC847a has dual 6-core low power 
> Xeon E5s.  At 10GbE+ these are probably going to be pushed pretty hard, 
> especially during recovery.

Today i have done a 500MB/s in cluster with 10Gb Ethernet during
recovery. With each machine 12 cores of Xeon E5600, do a 50 system
load !!

>
>>
>> - Steve
>>
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to [email protected]
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to [email protected]
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to [email protected]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to