Re: [gpfsug-discuss] gpfs native raid

2016-09-29 Thread Yuri L Volobuev

The issue of "GNR as software" is a pretty convoluted mixture of technical,
business, and resource constraints issues.  While some of the technical
issues can be discussed here, obviously the other considerations cannot be
discussed in a public forum.  So you won't be able to get a complete
understanding of the situation by discussing it here.

> I understand the support concerns, but I naively thought that assuming
> the hardware meets a basic set of requirements (e.g. redundant sas
> paths, x type of drives) it would be fairly supportable with GNR. The
> DS3700 shelves are re-branded NetApp E-series shelves and pretty vanilla
> I thought.

Setting business issues aside, this is more complicated on the technical
level than one may think.  At present, GNR requires a set of twin-tailed
external disk enclosures.  This is not a particularly exotic kind of
hardware, but it turns out that this corner of the storage world is quite
insular.  GNR has a very close relationship with physical disk devices,
much more so than regular GPFS.  In an ideal world, SCSI and SES standards
are supposed to provide a framework which would allow software like GNR to
operate on an arbitrary disk enclosure.  In the real world, the actual SES
implementations on various enclosures that we've been dealing with are,
well, peculiar.  Apparently SES is one of those standards where vendors
feel a lot of freedom in "re-interpreting" the standard, and since
typically enclosures talk to a small set of RAID controllers, there aren't
bad enough consequences to force vendors to be religious about SES standard
compliance.  Furthermore, the SAS fabric topology in configurations with an
external disk enclosures is surprisingly complex, and that complexity
predictably leads to complex failures which don't exist in simpler
configurations.  Thus far, every single one of the five enclosures we've
had a chance to run GNR on required some adjustments, workarounds, hacks,
etc.  And the consequences of a misbehaving SAS fabric can be quite dire.
There are various approaches to dealing with those complications, from
running a massive 3rd party hardware qualification program to basically
declaring any complications from an unknown enclosure to be someone else's
problem (how would ZFS deal with a SCSI reset storm due to a bad SAS
expander?), but there's much debate on what is the right path to take.
Customer input/feedback is obviously very valuable in tilting such
discussions in the right direction.

yuri



From:   Aaron Knister 
To: gpfsug-discuss@spectrumscale.org,
Date:   09/28/2016 06:44 PM
Subject:Re: [gpfsug-discuss] gpfs native raid
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Thanks Everyone for your replies! (Quick disclaimer, these opinions are
my own, and not those of my employer or NASA).

Not knowing what's coming at the NDA session, it seems to boil down to
"it ain't gonna happen" because of:

- Perceived difficulty in supporting whatever creative hardware
solutions customers may throw at it.

I understand the support concerns, but I naively thought that assuming
the hardware meets a basic set of requirements (e.g. redundant sas
paths, x type of drives) it would be fairly supportable with GNR. The
DS3700 shelves are re-branded NetApp E-series shelves and pretty vanilla
I thought.

- IBM would like to monetize the product and compete with the likes of
DDN/Seagate

This is admittedly a little disappointing. GPFS as long as I've known it
has been largely hardware vendor agnostic. To see even a slight shift
towards hardware vendor lockin and certain features only being supported
and available on IBM hardware is concerning. It's not like the software
itself is free. Perhaps GNR could be a paid add-on license for non-IBM
hardware? Just thinking out-loud.

The big things I was looking to GNR for are:

- end-to-end checksums
- implementing a software RAID layer on (in my case enterprise class) JBODs

I can find a way to do the second thing, but the former I cannot.
Requiring IBM hardware to get end-to-end checksums is a huge red flag
for me.  That's something Lustre will do today with ZFS on any hardware
ZFS will run on (and for free, I might add). I would think GNR being
openly available to customers would be important for GPFS to compete
with Lustre. Furthermore, I had opened an RFE (#84523) a while back to
implement checksumming of data for non-GNR environments. The RFE was
declined because essentially it would be too hard and it already exists
for GNR. Well, considering I don't have a GNR environment, and hardware
vendor lock in is something many sites are not interested in, that's
somewhat of a problem.

I really hope IBM reconsiders their stance on opening up GNR. The
current direction, while somewhat understandable, leaves a really bad
taste in my mouth and is one of the (very few, in my opinion) features
Lustre has over GPFS.

-Aaron


On 9/1/16 9:59 AM, Marc A Kaplan wrote:
> I've been 

Re: [gpfsug-discuss] Fwd: Blocksize

2016-09-29 Thread Olaf Weiser
Hi - let me try to explain[...]  I.e. 256 / 32 = 8K, so
am I reading / writing *2* inodes (assuming 4K inode size) minimum? [...]
Answer .. your inodes are written in
an separate (hidden) file .. with an MD blocksize of 256K ... you can access
 64 inodes with one IO to the file system so e.g. a policy ran need to initiate
1 IO to read 64 inodes... if your MD blocksize would be 1MB ..
you could access 256 Inodes with one IO to the file system meta data...(policy
runs)if you write an  new  regular
file, an inode gets created for it  .. and gets written into your
inode file... forget about MD blocksize here... it gets written directly so you
will see an 8n of segments (512 segment size) IO   to the MD (in case
your inode size is 4k) in addition...  other meta data
is stored in the system pool, like directory blocks or indirect blocks..
these blocks are 32K .. and so.. if you would choose a blocksize for MD
> 1 MB ... you would waste some space because of the rule 1/32 of blocksize
 is the smallest allocatable space in one line: my advice ...select 1MB
blocksize for MD ... [..] disk layout.. keep in mind.. your #IOPS is most likely
limited by your storage backend .. with spinning drives you can estimate
around 100 IOPS per drive.. even though the metaData is stored in
a hidden file.. inodes are access directly from/to disk during normal operation
.. you backend should be able to cache these IOs accordingly... but you
won't be able to avoid , that Inodes have to be flushed to disk and -other
way round-, red from disk  without accessing a full stripe of your
RAID .. so depending on the BE .. an N-Way replication is more efficient
here than a RAID6 or 8+2pin addition keep in mind.. if you divide
1 MB (blocksize from FS) into a RAID 6 or 8+2p ...the data transfer size
to each physical disk is rather small and will hurt your performance  l.b.n.l. .. in terms of IO .. you can
save a lot of IOPS to the physical disk layer.. if you go with an nWay
replication in comparison to RAID6 .. because every physical disk these
days... can satisfy an 1MB IO request ... So if you initiate 1 IO with
1MB size from GPFS .. it can be answered with exactly 1 IO from physical
disk .. (compared to RAID 6 .. - your storage
backend would have to satisfy this single IO with 1MB with at least 4 or
8 IOs ... ).. MD is rather small so the trade off
(waste of space ) can be ignored .. so go with RAID 1  or nWay replication...for
MD hope this helps..Mit freundlichen Grüßen / Kind regards Olaf Weiser EMEA Storage Competence Center Mainz, German / IBM Systems, Storage Platform,---IBM DeutschlandIBM Allee 171139 EhningenPhone: +49-170-579-44-66E-Mail: olaf.wei...@de.ibm.com---IBM Deutschland GmbH / Vorsitzender des Aufsichtsrats: Martin JetterGeschäftsführung: Martina Koederitz (Vorsitzende), Susanne Peter, Norbert
Janzen, Dr. Christian Keller, Ivo Koerner, Markus KoernerSitz der Gesellschaft: Ehningen / Registergericht: Amtsgericht Stuttgart,
HRB 14562 / WEEE-Reg.-Nr. DE 99369940 From:      
 "Buterbaugh, Kevin
L" To:      
 gpfsug main discussion
list Date:      
 09/29/2016 05:03 PMSubject:    
   [gpfsug-discuss]
Fwd:  BlocksizeSent by:    
   gpfsug-discuss-boun...@spectrumscale.orgResending from the right e-mail address...Begin forwarded message:From: gpfsug-discuss-ow...@spectrumscale.orgSubject: Re: [gpfsug-discuss] BlocksizeDate: September 29, 2016 at 10:00:36
AM CDTTo: k...@accre.vanderbilt.eduYou are not allowed to post to this mailing list, and
your message hasbeen automatically rejected.  If you think that your messages arebeing rejected in error, contact the mailing list owner atgpfsug-discuss-ow...@spectrumscale.org.From: "Kevin L. Buterbaugh"
Subject: Re: [gpfsug-discuss] BlocksizeHi
Marc and others, I understand … I guess I did a poor
job of wording my question, so I’ll try again.  The IBM recommendation
for metadata block size seems to be somewhere between 256K - 1 MB, depending
on who responds to the question.  If I were to hypothetically use
a 256K metadata block size, does the “1/32nd of a block” come into play
like it does for “not metadata”?  I.e. 256 / 32 = 8K, so am I reading
/ writing *2* inodes (assuming 4K inode size) minimum?Date: September 29, 2016 at 10:00:29
AM CDTTo: gpfsug main discussion list
Hi Marc and others, I understand … I guess I did a poor job of wording my
question, so I’ll try again.  The IBM recommendation for metadata
block size seems to be somewhere between 256K - 1 MB, depending on who
responds to the question.  If I were to hypothetically use a 256K
metadata block size, does the “1/32nd of a block” 

Re: [gpfsug-discuss] Fwd: Blocksize

2016-09-29 Thread Marc A Kaplan
Frankly, I just don't "get" what it is you seem not to be "getting"  - 
perhaps someone else who does "get" it can rephrase:  FORGET about 
Subblocks when thinking about inodes being packed into the file of all 
inodes. 

Additional facts that may address some of the other concerns:

I started working on GPFS at version 3.1 or so.  AFAIK GPFS always had and 
has one file of inodes, "packed", with no wasted space between inodes. 
Period. Full Stop.

RAID!  Now we come to a mistake that I've seen made by more than a handful 
of customers!

It is generally a mistake to use RAID with parity (such as classic RAID5) 
to store metadata.

Why?  Because metadata is often updated with "small writes"  - for example 
suppose we have to update some fields in an inode, or an indirect block, 
or append a log record...
For RAID with parity and large stripe sizes -- this means that updating 
just one disk sector can cost a full stripe read + writing the changed 
data and parity sectors.

SO, if you want protection against storage failures for your metadata, use 
either RAID mirroring/replication and/or GPFS metadata replication.  (belt 
and/or suspenders)
(Arguments against relying solely on RAID mirroring:  single enclosure/box 
failure (fire!), single hardware design (bugs or defects), single 
firmware/microcode(bugs.))

Yes, GPFS is part of "the cyber."  We're making it stronger everyday. But 
it already is great. 

--marc



From:   "Buterbaugh, Kevin L" 
To: gpfsug main discussion list 
Date:   09/29/2016 11:03 AM
Subject:[gpfsug-discuss] Fwd:  Blocksize
Sent by:gpfsug-discuss-boun...@spectrumscale.org



Resending from the right e-mail address...

Begin forwarded message:

From: gpfsug-discuss-ow...@spectrumscale.org
Subject: Re: [gpfsug-discuss] Blocksize
Date: September 29, 2016 at 10:00:36 AM CDT
To: k...@accre.vanderbilt.edu

You are not allowed to post to this mailing list, and your message has
been automatically rejected.  If you think that your messages are
being rejected in error, contact the mailing list owner at
gpfsug-discuss-ow...@spectrumscale.org.


From: "Kevin L. Buterbaugh" 
Subject: Re: [gpfsug-discuss] Blocksize
Date: September 29, 2016 at 10:00:29 AM CDT
To: gpfsug main discussion list 


Hi Marc and others, 

I understand … I guess I did a poor job of wording my question, so I’ll 
try again.  The IBM recommendation for metadata block size seems to be 
somewhere between 256K - 1 MB, depending on who responds to the question. 
If I were to hypothetically use a 256K metadata block size, does the 
“1/32nd of a block” come into play like it does for “not metadata”?  I.e. 
256 / 32 = 8K, so am I reading / writing *2* inodes (assuming 4K inode 
size) minimum?

And here’s a really off the wall question … yesterday we were discussing 
the fact that there is now a single inode file.  Historically, we have 
always used RAID 1 mirrors (first with spinning disk, as of last fall now 
on SSD) for metadata and then use GPFS replication on top of that.  But 
given that there is a single inode file is that “old way” of doing things 
still the right way?  In other words, could we potentially be better off 
by using a couple of 8+2P RAID 6 LUNs?

One potential downside of that would be that we would then only have two 
NSD servers serving up metadata, so we discussed the idea of taking each 
RAID 6 LUN and splitting it up into multiple logical volumes (all that 
done on the storage array, of course) and then presenting those to GPFS as 
NSDs???

Or have I gone from merely asking stupid questions to Trump-level 
craziness  ;-)

Kevin

On Sep 28, 2016, at 10:23 AM, Marc A Kaplan  wrote:

OKAY, I'll say it again.  inodes are PACKED into a single inode file.  So 
a 4KB inode takes 4KB, REGARDLESS of metadata blocksize.  There is no 
wasted space.

(Of course if you have metadata replication = 2, then yes, double that. 
And yes, there overhead for indirect blocks (indices), allocation maps, 
etc, etc.)

And your choice is not just 512 or 4096.  Maybe 1KB or 2KB is a good 
choice for your data distribution, to optimize packing of data and/or 
directories into inodes...

Hmmm... I don't know why the doc leaves out 2048, perhaps a typo...

mmcrfs x2K -i 2048

[root@n2 charts]# mmlsfs x2K -i
flagvaluedescription
---  
---
 -i 2048 Inode size in bytes

Works for me!
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss



___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss




[gpfsug-discuss] Fwd: Blocksize

2016-09-29 Thread Buterbaugh, Kevin L
Resending from the right e-mail address...

Begin forwarded message:

From: 
gpfsug-discuss-ow...@spectrumscale.org
Subject: Re: [gpfsug-discuss] Blocksize
Date: September 29, 2016 at 10:00:36 AM CDT
To: k...@accre.vanderbilt.edu

You are not allowed to post to this mailing list, and your message has
been automatically rejected.  If you think that your messages are
being rejected in error, contact the mailing list owner at
gpfsug-discuss-ow...@spectrumscale.org.


From: "Kevin L. Buterbaugh" 
>
Subject: Re: [gpfsug-discuss] Blocksize
Date: September 29, 2016 at 10:00:29 AM CDT
To: gpfsug main discussion list 
>


Hi Marc and others,

I understand … I guess I did a poor job of wording my question, so I’ll try 
again.  The IBM recommendation for metadata block size seems to be somewhere 
between 256K - 1 MB, depending on who responds to the question.  If I were to 
hypothetically use a 256K metadata block size, does the “1/32nd of a block” 
come into play like it does for “not metadata”?  I.e. 256 / 32 = 8K, so am I 
reading / writing *2* inodes (assuming 4K inode size) minimum?

And here’s a really off the wall question … yesterday we were discussing the 
fact that there is now a single inode file.  Historically, we have always used 
RAID 1 mirrors (first with spinning disk, as of last fall now on SSD) for 
metadata and then use GPFS replication on top of that.  But given that there is 
a single inode file is that “old way” of doing things still the right way?  In 
other words, could we potentially be better off by using a couple of 8+2P RAID 
6 LUNs?

One potential downside of that would be that we would then only have two NSD 
servers serving up metadata, so we discussed the idea of taking each RAID 6 LUN 
and splitting it up into multiple logical volumes (all that done on the storage 
array, of course) and then presenting those to GPFS as NSDs???

Or have I gone from merely asking stupid questions to Trump-level craziness 
 ;-)

Kevin

On Sep 28, 2016, at 10:23 AM, Marc A Kaplan 
> wrote:

OKAY, I'll say it again.  inodes are PACKED into a single inode file.  So a 4KB 
inode takes 4KB, REGARDLESS of metadata blocksize.  There is no wasted space.

(Of course if you have metadata replication = 2, then yes, double that.  And 
yes, there overhead for indirect blocks (indices), allocation maps, etc, etc.)

And your choice is not just 512 or 4096.  Maybe 1KB or 2KB is a good choice for 
your data distribution, to optimize packing of data and/or directories into 
inodes...

Hmmm... I don't know why the doc leaves out 2048, perhaps a typo...

mmcrfs x2K -i 2048

[root@n2 charts]# mmlsfs x2K -i
flagvaluedescription
---  ---
 -i 2048 Inode size in bytes

Works for me!
___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss




___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


[gpfsug-discuss] AFM cacheset mounting from the same GPFS cluster ?

2016-09-29 Thread Daniel Kidger
 
I was talking to a former colleague yesterday.
He was asking about of people ever use AFM to mount a cached fileset not from a different cluster but from you own GPFS cluster.
This would be a Flash/SSD based fileset caching from NL-SAS based Storage Pools.
 
The motivation was to enable at least some of the 4 features of Cray's DataWarp accelerator:
 
1. Burst buffer for HPC writes.
This is motivated by the shift from say 5 years ago when your bought HPC /scratch for volume and got the bandwidth for free, but now if you buy for bandwidth you end up buying say 3x more space than you really need so overly expensive.
 
2. Staging for Read-only files. Batch job prefetches file into the cached fileset so parallel execution starts quicker
 
3. Local '/tmp' for MPI processes which might be faster or bigger than what each compute node has.
 
4. a shared /tmp where some MPI processes write files that other then read.
Compare this with a simple file placement rule that puts these files on an SSD Storage Pool
 
 
So does anyone actually do this?
and if so what was your experience?
Daniel

 

  Dr Daniel KidgerIBM Technical Sales SpecialistSoftware Defined Solution Sales+44-(0)7818 522 266 daniel.kid...@uk.ibm.com Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


___
gpfsug-discuss mailing list
gpfsug-discuss at spectrumscale.org
http://gpfsug.org/mailman/listinfo/gpfsug-discuss


Re: [gpfsug-discuss] gpfs native raid

2016-09-29 Thread Daniel Kidger
I wholeheartedly agree with Sven here.The Request for Engineering (RFE) is a powerful feedback mechanism to help steer the product's direction.On the topic of GNR, the (free) alternative to compete with GNR is ZFS. Although not yet used in any production GPFS system, it is widely used under Lustre. It would however take some work to optimise ZFS for GPFS as has already been done for Lustre. Hence a good reason to lobby IBM to maintain a GPFS Software Raid in the face of possible future competition from ZFS.DanielIBM Spectrum Storage Software+44 (0)7818 522266Sent from my iPad using IBM VerseOn 29 Sep 2016, 04:28:33, oeh...@us.ibm.com wrote:From: oeh...@us.ibm.comTo: gpfsug-discuss@spectrumscale.orgCc: Date: 29 Sep 2016 04:28:33Subject: Re: [gpfsug-discuss] gpfs native raidHi Aaron,the best way to express this 'need' is to vote and leave comments in the RFE's : this is an RFE for GNR as SW :  http://www.ibm.com/developerworks/rfe/execute?use_case=viewRfe_ID=95090everybody who wants this to be one should vote for it and leave comments on what they expect.SvenAaron Knister ---09/28/2016 06:44:25 PM---Thanks Everyone for your replies! (Quick disclaimer, these opinions are  my own, and not those of myFrom:Aaron Knister To:gpfsug-discuss@spectrumscale.orgDate:09/28/2016 06:44 PMSubject:Re: [gpfsug-discuss] gpfs native raidSent by:gpfsug-discuss-boun...@spectrumscale.orgThanks Everyone for your replies! (Quick disclaimer, these opinions are my own, and not those of my employer or NASA).Not knowing what's coming at the NDA session, it seems to boil down to "it ain't gonna happen" because of:- Perceived difficulty in supporting whatever creative hardware solutions customers may throw at it.I understand the support concerns, but I naively thought that assuming the hardware meets a basic set of requirements (e.g. redundant sas paths, x type of drives) it would be fairly supportable with GNR. The DS3700 shelves are re-branded NetApp E-series shelves and pretty vanilla I thought.- IBM would like to monetize the product and compete with the likes of DDN/SeagateThis is admittedly a little disappointing. GPFS as long as I've known it has been largely hardware vendor agnostic. To see even a slight shift towards hardware vendor lockin and certain features only being supported and available on IBM hardware is concerning. It's not like the software itself is free. Perhaps GNR could be a paid add-on license for non-IBM hardware? Just thinking out-loud.The big things I was looking to GNR for are:- end-to-end checksums- implementing a software RAID layer on (in my case enterprise class) JBODsI can find a way to do the second thing, but the former I cannot. Requiring IBM hardware to get end-to-end checksums is a huge red flag for me.  That's something Lustre will do today with ZFS on any hardware ZFS will run on (and for free, I might add). I would think GNR being openly available to customers would be important for GPFS to compete with Lustre. Furthermore, I had opened an RFE (#84523) a while back to implement checksumming of data for non-GNR environments. The RFE was declined because essentially it would be too hard and it already exists for GNR. Well, considering I don't have a GNR environment, and hardware vendor lock in is something many sites are not interested in, that's somewhat of a problem.I really hope IBM reconsiders their stance on opening up GNR. The current direction, while somewhat understandable, leaves a really bad taste in my mouth and is one of the (very few, in my opinion) features Lustre has over GPFS.-AaronOn 9/1/16 9:59 AM, Marc A Kaplan wrote:> I've been told that it is a big leap to go from supporting GSS and ESS> to allowing and supporting native raid for customers who may throw> together "any" combination of hardware they might choose.>> In particular the GNR "disk hospital" functions...> https://www.ibm.com/support/knowledgecenter/SSFKCN_3.5.0/com.ibm.cluster.gpfs.v3r5.gpfs200.doc/bl1adv_introdiskhospital.htm> will be tricky to support on umpteen different vendor boxes -- and keep> in mind, those will be from IBM competitors!>> That said, ESS and GSS show that IBM has some good tech in this area and> IBM has shown with the Spectrum Scale product (sans GNR) it can support> just about any semi-reasonable hardware configuration and a good slew of> OS versions and architectures... Heck I have a demo/test version of GPFS> running on a 5 year old Thinkpad laptop And we have some GSSs in the> lab... Not to mention Power hardware and mainframe System Z (think 360,> 370, 290, Z)> ___> gpfsug-discuss mailing list> gpfsug-discuss at spectrumscale.org> http://gpfsug.org/mailman/listinfo/gpfsug-discuss>___gpfsug-discuss mailing listgpfsug-discuss at spectrumscale.orghttp://gpfsug.org/mailman/listinfo/gpfsug-discuss
Unless stated otherwise above:
IBM