Re: [pfSense] APU and SSD: full install or NanoBSD

Jim Thompson Thu, 30 Oct 2014 19:08:46 -0700

> On Oct 30, 2014, at 7:35 PM, Dave Warren <[email protected]> wrote:
> 
> On 2014-10-30 17:15, Jim Thompson wrote:
>>> On Oct 30, 2014, at 3:39 PM, Dave Warren <[email protected]> wrote:
>>> Buy quality instead of junk?
> <...>
>>> Even a cheapo 30GB/60GB/whatever SSD is more than enough for pfSense and 
>>> makes a far more reliable solution than external flash.
>> I strongly disagree.    SSDs have to be part of a system, especially in an 
>> embedded environment.   The debacle with the “cheap 30GB” m-sata drive from 
>> PC Engines earlier in the year (they had to take them all back) should amply 
>> demonstrate why thinking such as what you express here is deeply flawed.
> 
> Sorry if I wasn't clear, I meant a cheapo SSD because it's small -- I'm 
> suggesting you don't need to invest in a large or fast SSD for pfSense, but 
> rather, cheap out on size, while getting a quality device built for lifespan 
> and reliability.


Understood, but even here your suggestion is out of date with respect to the 
current state-of-the-art.    Assuming a decent wear-leveling implementation, 
larger drives will last longer for a given amount of data written.  In the same 
way that, when flying an airplane you can trade altitude for glide, with modern 
SSDs, you can trade capacity for endurance.

(It also matters *how* you write the data.)

In the below, I’m quoting JEDEC-219 compliant numbers/stats.

Here’s an equation you might want to think about.

Total writes to the device <= (Max endurance cycles) * (total partition 
capacity) / (WAF)

Where Maximum Endurance Cycles = the total number of program erase cycles each 
block in the NAND flash can withstand. For the current generation of MLC flash 
this is 3,000 Program-Erase Cycles.

Write Amplification Factor (WAF) = is a result of wear leveling activity to 
some degree and the nature of writes to the flash. The actual nand flash is 
written in units of pages. For the current generation of flash, this page size 
is typically 16K Bytes. If the nature of writes are sequential within the 16K 
page, then the WAF should be low. However if this write information is not 
contiguous, or is interrupted by another write stream then the partial page 
will be programmed to the NAND flash. In general, random writes will contribute 
to higher WAF.

Ideally we would want WAF to be 1. However, this is the real world, and we have 
seen this go as high as 20 in some applications with non-ideal  writing 
behavior. (Very poorly behaved, always non-contiguous or interrupted write 
streams, e.g. logging or sql databases.)

Example:
Application that writes 100 MB of data to the device per day. 
100 MB / day * 365 days / year = 36.5 GB / year

Let’s assume a standard mode 4GB CF card/USB/… with perfect wear-leveling 
(LOL!):

Best case:
For WAF = 1, standard mode 30GB part:
Total Writes = (3,000) * (4GB) / 1 = 12 TB
 With the above data this yields: 12,000 / 36.5 = 329 years

Worst case:
WAF = 20, standard mode 4GB part:
Total Writes to reach endurance = (3,000) * (4GB) / 20 = 600 GB of data written 
will exceed endurance
With the above data this yields: 600 / 36.5 = 16.4 years

This is how a “commodity” flash/SSD vendor (or a shill^W “technology 
journalist”) will talk to you:  “It will take more than 16.4 years to wear out 
the disk!”

The reality is that with the 3000 program-erase cycles rating of today's 
underlying MLC cells, the 30GB part can support a "worst case" 600GB of data
writes assuming very poorly behaved, always non-contiguous or interrupted write 
streams.  "Best case" assuming purely contiguous writes would be 12TB. 

Actual worst case without effective wear-leveling (as was the case with CF 
cards and a lot of the early SSDs) would be 3,000 writes to a single 16K page.  
(Thus the “don’t swap to an SSD!” advice so often heard.)  Do this, and “Boom!” 
the sector is dead (or will be quite soon.)  If this was in a file that you 
needed (or worse, a filesystem metadata block), *poof* goes your data.  Bummer, 
dude.   This is *also* why SLC flash is often recommended for applications that 
require high write-endurance.  SLC flash can endure approximately 10X the 
program-erase cycles of MLC flash in a given lithography.

The direct result is that today you see a lot of people attempting to quote 
“TBW” (terabytes written) when talking about SSD / flash endurance, but even 
then they don’t talk about WAF very often.

Once you start thinking about it, it’s not very difficult to figure out that it 
doesn’t take long to write 600GB on a very busy system, that does a lot of 
short writes due to logging, etc.

Now go run the numbers yourself for a larger SSD (and you can assume 
wear-leveling).   Double the size of the device, and you’ll double the TBW 
figure, assuming everything else stays the same.  Larger density devices will 
yield correspondingly higher total write endurance since (QED!) they have more 
blocks of NAND in them.

Here is the kicker: the eMMCs we’re using on the coming Netgate hardware (that 
yes, will be available at the pfSense store) have an “enhanced” or “pseudo-SLC” 
mode.  This mode
reduces device capacity by 50% (a 4GB device becomes a 2GB device), but 
increases the maximum endurance cycles 10X to 30,000, instead of 3,000, and the 
.6TBW becomes 6TBW *worst case*.   With a bit of work to the system (such as 
some of what happens to reduce writes in the nano images), we can probably get 
the WAF into the 5-10 range, yielding
an additional 2x-4x increase in TBW (so 12TB - 24TB) on a 4GB part that is in a 
“2GB” mode.

Note that this gain is really “software at work”.   Better/more sophisticated 
wear-leveling, combined with on-die RAM in the flash controller, and a whole 
suite of other enhancements,
and that eMMC on your tablet or phone is way, way better than sticking a SD 
card in the side of it.

Nevermind the steeper volume curve diminishes the cost of a fully-deployed 
solution more quickly.

(Want to take a guess why nearly nobody offers tablets or phones with SD 
sockets in them now?)

What to guess what the reliability difference is compared to loading the 
internal “bootable” USB2 socket on the Soekris 6801 
(http://soekris.com/products/net6801.html) or the SD card socket on the APU or 
a CF card on the Lanner boxes / Alix?

Yeah, it’s bad.   Go ahead and run the nano/embedded image on these if you have 
one, because it’s the best solution available today.

But if you’re still reading, now run all the math again for a 32GB part:

32GB eMMC endurance in MLC mode:   4.8TB worst case, 96TB best case
32GB eMMC endurance in pseudo-SLC mode (capacity reduced to 16GB):  24TB worst 
case, 480TB best case  <—  480TB / 36.5GB/year = 13,150 years.

oh wow.  

The “pseudo-SLC” mode isn’t the kind of thing you’re going to find on 
“consumer” parts.  It takes a system vendor to design this kind of thing in and 
enable it.

Now go back and look at the ZFS results that I posted over the summer, where I 
made the filesystem ZFS, and set “copies=2” and turned on lzs compression.   
Performance (especially read performance goes up, you have better redundancy, 
and software updates actually update both copies (something that *does not 
happen* with the nano-based images today.)   You have a full filesystem 
underneath you, and the whole system behaves a whole lot better.   set 
“copies=1” (or just never set it to anything), and you still have compression 
reducing the amount of data actually written to the device by ~~2X, and ZFS a 
far more resilient filesystem than UFS (because of ZFS’s copy-on-write 
semantics), and a filesystem that treats the SSD/flash/.. far better than UFS 
ever would.

I could go on for 2-3 more pages about this, explaining that as the feature 
size of the utilized lithography goes down, the endurance also goes down 
(because the voltages held in adjacent cells are more likely to influence the 
read), and that as temperatures go up, read reliability goes down (thermal 
‘noise’), but all you really need to know, frankly, is that the world has 
changed since 2004, that while “nanoBSD” is a good solution, it is no longer a 
great solution, and that your “buy cheap small SSDs” advice “sounds right”, but 
is, quite likely, wrong.

Again, (because people love to take me out of context), as before, you’re more 
than welcome to go buy whatever on eBay or Amazon (…), the project is open 
source and freely available, but understand that “buying cheap" has potentially 
disastrous  consequences, especially as things scale when people attempt to 
become vendors.

The above should also somewhat explain “Why” we ship the C2758 with a 80GB 
Intel SSD (yes, one that has a power-fail cap) and the full install of our 
enhanced version of pfSense.

… and we’ve never had one back for disk failure.



— The Lizard Has Spoken





_______________________________________________
List mailing list
[email protected]
https://lists.pfsense.org/mailman/listinfo/list

Re: [pfSense] APU and SSD: full install or NanoBSD

Reply via email to