Re: Software raid over iSCSI

Andrew McGill Tue, 23 Dec 2008 00:14:16 -0800

>From another list, the mail below is a proposal to change the default 
partition table for disks from 512 bytes to 4096 bytes.  I think that once 
implemented (in a few years / days time), it will make some of the alignment 
problems due to the 512-byte MSDOS partition table go away.  The complete 
thread has some references to how the SCSI code determines the "geometry" -- 
http://news.gmane.org/group/gmane.linux.utilities.util-linux-ng/last=

Subject: Changing the default CHS used by Linux partition editors
From:  "Theodore Ts'o" <ty...@mit.edu>
  To: util-linux...@vger.kernel.org, Eric Sandeen <sand...@redhat.com>, Ric 
Wheeler <rwhee...@redhat.com>, James Bottomley 
<james.bottom...@hansenpartnership.com>, Jeff Garzik <jgar...@redhat.com>, 
Curtis Gedak <ged...@gmail.com>
  Date: 2008-12-12 00:30

I attended the IDEMA (International Disk Drive Equipment and Materials
Association) conference today to give a talk about Linux, and during one
of the breaks I got buttonholed by someone who asked me if I could help
make sure Linux would be able to deal with the upcoming HDD sector size
move from 512 to 4096.  Just coincidentally, I ran across the following
article from Slashdot, "Which Operating System Is Best For solid-state
disks":

http://www.computerworld.com/action/article.do?command=viewArticleBasic&taxonomyName=Storage&articleId=9123140&taxonomyId=19&pageNumber=1

Quoting from that article, Justin Sykes from Micron Technologies stated:

        "NAND [flash memory] fundamentally has native 4K block
        sizes. Anything that's not aligned to a 4K block creates extra
        challenges," Sykes said. "There ends up being background
        operations to garbage-collect that empty space [in larger file
        blocks] that isn't fully utilized. And, so that activity is
        chewing up your bandwidth in the background, and it adds extra
        wear to the NAND [flash memory]."

I fully expect that perhaps someone from San Disk or Intel will pop up
and say that "this is just Micron's SSD's suck; *our* SSD's won't have
this problem".  Perhaps; but HDD's won't be going away any time soon[1],
and they will be moving to a 4k block size in the next few years.

So what's the problem?   The main problem seems to be that by default,
we are using partition tables that cause the partitions to be not
aligned on 4k boudaries, because of the default hdd geometry used by our
partition tools and returned by the HDIO_GETGEO ioctl:

Disk /dev/sda: 255 heads, 63 sectors, 38913 cylinders

Nr AF  Hd Sec  Cyl  Hd Sec  Cyl     Start      Size ID
 1 80   1   1    0 254  63  121         63    1959867 83
 2 00   0   1  122 254  63  619    1959930    8000370 82
 3 00   0   1  620 254  63 1023    9960300  615177045 05
 4 00   0   0    0   0   0    0          0          0 00
 5 00   1   1  620 254  63 1023         63  615176982 8e

For pretty much all modern systems --- certainly any drive using the
SATA interface, the boot loader no longer needs to use the original CHS
INT13 interface, so what we pick for the CHS geometry doesn't matter as
far as bootloaders are concerned.  Linux only uses LBA's so the bottom
line is that aside from controlling the alignment of partitions, CHS's
don't really matter.

For SSD's and HDD's that use a 4k internal sector size, being 4k aligned
makes a big difference because it avoids read-modify-write cycles.  We
can achieve this easily if we simply use a CHS geometry of 56
sectors/track instead of 63 sectors.  So, I would propose that we change
the default geometry used by the partitioning tools in util-linux-ng,
gparted, etc. so the default sectors is 56; furthermore, to catch those
partitioning tools that use the HDIO_GETGEO ioctl, that we change the
fantasy geometry generated in drivers/scsi/scsicam.c:scsicam_bios_param()
and drivers/ata/libata-scsi.c to also use a 255/56 head/sector geometry.

Does this make sense?  Am I missing some fatal flaw?  Should I send
patches?

                                                - Ted

[1] There was an absolutely brilliant presentation at the IDEMA
conference from Steve Hetzler, an IBM Fellow from Almaden Research Lab,
that used an economic argument based the capital cost of the Fab's and
what would happen if one were to move *all* of the world's Silicon Fabs
to generating flash for SSD's --- this would only satisfy 18% of the HDD
market --- and the total size of the HDD market by revenue is $35
billion, and the value of the output of the Si Fab's today is $280
billion --- so are we going to give up $280 billion dollars worth of
revenue from the current products of today's available Fabs in order to
displace 18% of the HDD $35 billion market?

What about building new Fabs?  Well, building new fabs sufficient to
create enough flash to replace all of the HDD market would cost
approximately one trillion dollars.  A single Fab 45mm fab is $3-4
billion; and a 22mm Fab will probably cost be $7-8billion.  (This is
just the cost to *build* the Fab; it ignores the materials and operating
cost, would be on top of this.)  Intel brings on line maybe a fab or two
a year --- and Moore's law doesn't help that much, because the each
shrink quanduples the amount of Flash that can be created on each wafer,
but it also doubles the cost of the Fab; and the of the HDD market is
still increasing at 40% a year.  Anyway, I'm not doing Dr. Hetzler's
talk justice, but bottom line, Aryan's claims that SDD's will completely
displace HDD's within five years may very well be a
little.... over-optimistic.

In other words, the flash production may be doubling every year, but
that was starting from a relatively small base compared to the HDD
market --- and to catch up and overtake the HDD market, it needs to do
far more than that --- and the model of using older fabs that had been
used for the previous generation of CPU's isn't going to be enough to
meet the demand, so *if* SSD's were to become as popular as some of the
SSD cheerleaders have stated, the current NAND oversupply could very
easily become an undersupply.

--
To unsubscribe from this list: send the line "unsubscribe util-linux-ng" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

On Monday 22 December 2008 15:11:26 Eric wrote:
> Ulrich,
>
> According to what I read, whilst not directly affecting the physical
> disk, heads and cylinders do matter. It has something to do with the
> IO scheduler where it causes misalignemnt. It's a long story but it
> comes down to roughly twice the disk activity than is actually
> necessary. It seems the actual value of the number of heads and
> cylinders doesn't really matter, as long as it divisable by 2.
>
> Besides that, I don't think I totally agree on your statement. Letting
> the IO scheduler taking the actual disk geometry into account, should
> give more performance. However, the fdisk default values are probably
> standard for disks these days.
>
> Nonetheless, a virtual server, unaware of the underlying hardware,
> does need these values to be set properly.
>
> On Dec 22, 1:18 pm, "Ulrich Windl" <ulrich.wi...@rz.uni-regensburg.de>
>
> wrote:
> > On 22 Dec 2008 at 2:19, Eric wrote:
> > > Hello,
> > >
> > > I'm going to setup a software raid over iSCSI. While I probably should
> > > ask my question to the raid people, my guess was someone here might
> > > have experience with this.
> > >
> > > I'm going to use the following topology:
> > >
> > > There will be 2 storage servers exporting targets. Other physical
> > > machines will initiate a target from each storage server to create a
> > > software raid (1). Virtual machines will use these raids as their
> > > disks.
> > >
> > > Apart from the standard optimizations, there's one I haven't been able
> > > to find any information on. It is suggested that initiators, using the
> > > disks (in this case the virtualized machines) should use a head and
> > > cylinder count that is divisable by 2. Is this suggestion correct;
> > > does it (still) apply to iSCSI?
> >
> > Hi!
> >
> > I think since ZBR (Zone Bit Recording) the number of sectors per cylinder
> > is variable. thus it makes no sense for any higher-level disk software to
> > try to deal with heads or cylinders. Since ATA (about 1990) only the
> > controller on the disk knows the tracks, heads, and cylinders. The rest
> > is just logic. Therefore SCSI (nad now LBA) just uses logical block
> > numbers.
> >
> > > Now for the actual question: I don't know exactly how linux software
> > > raid works internally, but it sounds like logic to me, that the
> > > software raid should be aware of the head and cylinder count as well.
> > >
> > > So, when creating raid partitions on the targets, I should also modify
> > > the head and cylinder count on these. Is this a correct assumption and
> > > will software raid use the values advertised in the partition table?
> > >
> > > Any answers or suggestions are appreciated. Thanks in advance.
> >
> > Only for MS-DOS compatibility you need C/H/S addressing. The rest doesn't
> > care AFAIK.
> >
> > Regards,
> > Ulrich
> >
> > > Kind regards,
> > >
> > > Eric- Hide quoted text -
> >
> > - Show quoted text -- Hide quoted text -
> >
> > - Show quoted text -
>
> 

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"open-iscsi" group.
To post to this group, send email to open-iscsi@googlegroups.com
To unsubscribe from this group, send email to 
open-iscsi+unsubscr...@googlegroups.com
For more options, visit this group at http://groups.google.com/group/open-iscsi
-~----------~----~----~----~------~----~------~--~---

Re: Software raid over iSCSI

Reply via email to