On Jul 1, 2019, at 10:10 AM, Valeri Galtsev <galt...@kicp.uchicago.edu> wrote:
> 
> On 2019-07-01 10:01, Warren Young wrote:
>> On Jul 1, 2019, at 8:26 AM, Valeri Galtsev <galt...@kicp.uchicago.edu> wrote:
>>> 
>>> RAID function, which boils down to simple, short, easy to debug well 
>>> program.
> 
> I didn't intend to start software vs hardware RAID flame war

Where is this flame war you speak of?  I’m over here having a reasonable 
discussion.  I’ll continue being reasonable, if that’s all right with you. :)

> Now, commenting with all due respect to famous person who Warren Young 
> definitely is.

Since when?  I’m not even Internet Famous.

>> RAID firmware will be harder to debug than Linux software RAID, if only 
>> because of easier-to-use tools.
> 
> I myself debug neither firmware (or "microcode", speaking the language as it 
> was some 30 years ago)

There is a big distinction between those two terms; they are not equivalent 
terms from different points in history.  I had a big digression explaining the 
difference, but I’ve cut it as entirely off-topic.

It suffices to say that with hardware RAID, you’re almost certainly talking 
about firmware, not microcode, not just today, but also 30 years ago.  
Microcode is a much lower level thing than what happens at the user-facing 
product level of RAID controllers.

> In both cases it is someone else who does the debugging.

If it takes three times as much developer time to debug a RAID card firmware as 
it does to debug Linux MD RAID, and the latter has to be debugged only once 
instead of multiple times as the hardware RAID firmware is reinvented again and 
again, which one do you suppose ends up with more bugs?

> You are speaking as the person who routinely debugs Linux components.

I have enough work fixing my own bugs that I rarely find time to fix others’ 
bugs.  But yes, it does happen once in a while.

> 1. Linux kernel itself, which is huge;

…under which your hardware RAID card’s driver runs, making it even more huge 
than it was before that driver was added.

You can’t zero out the Linux kernel code base size when talking about hardware 
RAID.  It’s not like the card sits there and runs in a purely isolated 
environment.

It is a testament to how well-debugged the Linux kernel is that your hardware 
RAID card runs so well!

> All of the above can potentially panic kernel (as they all run in kernel 
> context), so they all affect reliability of software RAID, not only the chunk 
> of software doing software RAID function.

When the kernel panics, what do you suppose happens to the hardware RAID card?  
Does it keep doing useful work, and if so, for how long?

What’s more likely these days: a kernel panic or an unwanted hardware restart?  
And when that happens, which is more likely to fail, a hardware RAID without 
BBU/NV storage or a software RAID designed to be always-consistent?

I’m stripping away your hardware RAID’s advantage in NV storage to keep things 
equal in cost: my on-board SATA ports for your stripped-down hardware RAID 
card.  You probably still paid more, but I’ll give you that, since you’re using 
non-commodity hardware.

Now that they’re on even footing, which one is more reliable?

> hardware RAID "firmware" program being small and logically simple

You’ve made an unwarranted assumption.

I just did a blind web search and found this page:

   
https://www.broadcom.com/products/storage/raid-controllers/megaraid-sas-9361-8i#downloads

…on which we find that the RAID firmware for the card is 4.1 MB, compressed.

Now, that’s considered a small file these days, but realize that there are no 
1024 px² icon files in there, no massive XML libraries, no language 
internationalization files, no high-level language runtimes… It’s just millions 
of low-level highly-optimized CPU instructions.

From experience, I’d expect it to take something like 5-10 person-years to 
reproduce that much code.

That’s far from being “small and logically simple.”

> it usually runs on RISC architecture CPU, and introduce bugs programming for 
> RISC architecture IMHO is more difficult that when programming for i386 and 
> amd64 architectures.

I don’t think I’ve seen any such study, and if I did, I’d expect it to only be 
talking about assembly language programming.

Above that level, you’re talking about high-level language compilers, and I 
don’t think the underlying CPU architecture has anything to do with the error 
rates in programs written in high-level languages.

I’d expect RAID firmware to be written in C, not assembly language, which means 
the CPU the has little or nothing to do with programmer error rates.

Thought experiment: does Linux have fewer bugs on ARM than on x86_64?

I even doubt that you can dig up a study showing that assembly language 
programming on CISC is significantly more error-prone than RISC programming in 
the first place.  My experience says that error rates in programs are largely a 
function of the number of lines of code, and that puts RISC at a severe 
disadvantage.  For ARM vs x86, the instruction ratio is roughly 3:1 for 
equivalent user-facing functionality.

There are many good reasons why the error rate in programs should be so 
strongly governed by lines of code:

1. More LoC is more chances for typos and logic errors.

2. More LoC means a smaller proportion of the solution fits on the screen at 
once, hiding information from the programmer.  Out of sight, out of mind.

3. More LoC takes more time to compose and type, so a programmer writing fewer 
LoC has more time to debug and test, all else being equal.

This is also why almost no one writes in assembly any more, and those who do 
rarely write *just* in assembly.
_______________________________________________
CentOS mailing list
CentOS@centos.org
https://lists.centos.org/mailman/listinfo/centos

Reply via email to