Re: Fwd: [sqlite] presentation about ordering and atomicity of filesystems
On 09/12/2014 01:59 PM, Paul Gilmartin wrote: On Fri, 12 Sep 2014 09:16:54 -0700, Anne Lynn Wheeler wrote: re: http://www.garlic.com/~lynn/2014k.html#7 [sqlite] presentation about ordering and atomicity of filesystems part of the issue was that incomplete write ... with propogated zeros ... would also (then) rewrite the error correcting codes for the record (with propogated zeros) ... so there wouldn't even be an error indication that the write was performed incorrectly (installation wouldn't even know to perform restore because of write error). It's almost as if they concealed the error on purpose. Well, not quite; it depends on where in the data path the ECC was generated -- it should have been done farther upstream. later fba disks ... especially in conjunction with raid ... had requirement that single block write would complete correctly once started. ... With what probability, and subject to what assumptions? If the data lead to the write head fails mechanically at a critical time, a bad block will be written. Negligibly improbable? Yes. Physically impossible? No. Detectable by ECC? Probably. -- gil If the hardware knows it has incomplete information to write an entire block because of some abnormal hardware condition, then something should be done to guarantee that any later attempt to read that block will produce an error indication. If that is not the case, this would appear to be a violation of one of the major tenets of mainframe design: that any data errors resulting from hardware issues should be at least detectable, if not correctable. Writing a valid block with trailing zeros in such a case sounds a bad design decision. -- Joel C. Ewing,Bentonville, AR jcew...@acm.org -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Fwd: [sqlite] presentation about ordering and atomicity of filesystems
jcew...@acm.org (Joel C. Ewing) writes: If the hardware knows it has incomplete information to write an entire block because of some abnormal hardware condition, then something should be done to guarantee that any later attempt to read that block will produce an error indication. If that is not the case, this would appear to be a violation of one of the major tenets of mainframe design: that any data errors resulting from hardware issues should be at least detectable, if not correctable. Writing a valid block with trailing zeros in such a case sounds a bad design decision. re: http://www.garlic.com/~lynn/2014k.html#7 Fwd: [sqlite] presentation about ordering and atomicity of filesystems http://www.garlic.com/~lynn/2014k.html#8 Fwd: [sqlite] presentation about ordering and atomicity of filesystems *and* generating a valid error correcting code for the propogated zeros at one point i was asked to audit some of the early raid5 vendors ... and there were some cases where i had to give presentations on what no-single-point-of-failure means (having found single points of failure). nearly decade earlier, i was involved in working with NSF on interconnecting NSF supercomputer centers (later evolves into the NSFNET backbone, precursor to the modern internet) ... some old email http://www.garlic.com/~lynn/lhwemail.html#nsfnet in part because had internal (HSDT) project with T1 (1.5mbit/sec) and faster links ... some past posts http://www.garlic.com/~lynn/subnetwork.html#hsdt one of the people working on the effort had been graduate student of Reed at jpl/caltech and did a lot of the original work on reed-solomon (error correcting code). Also got to work with cyclotomics up in berkeley (on of the founders was berlekamp) ... cyclotomics did a lot of the reed-solomon stuff that shows up in the cdrom standard ... during this period, they were bought by kodak. a couple recent posts http://www.garlic.com/~lynn/2014g.html#75 non-IBM: SONY new tape storage - 185 Terabytes on a tape http://www.garlic.com/~lynn/2014j.html#68 No Internet. No Microsoft Windows. No iPods. This Is What Tech Was Like In 1984 reed-solomon http://en.wikipedia.org/wiki/Reed%E2%80%93Solomon_error_correction as previously mentioned ... one of the justifications for the industry moving from fba-512 to fba-4096 was reducing space taken up by error correcting code: http://en.wikipedia.org/wiki/Advanced_Format past posts mentioning fba, ckd, multi-track search, etc http://www.garlic.com/~lynn/submain.html#dasd -- virtualization experience starting Jan1968, online at home since Mar1970 -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Fwd: [sqlite] presentation about ordering and atomicity of filesystems
This is not about the z, per se, but is interesting. I don't think that any of the IBM systems have this type of filesystem. Hum, perhaps the i? -- Forwarded message -- From: Kees Nuyt k.n...@zonnet.nl Date: Thu, Sep 11, 2014 at 4:49 PM Subject: [sqlite] presentation about ordering and atomicity of filesystems To: sqlite-us...@sqlite.org Hi all, Today I bumped into a presentation about ordering and atomicity of filesystems that might interest you. https://www.youtube.com/watch?v=YvchhB1-Aws The Application/Storage Interface: After All These Years, We're Still Doing It Wrong Remzi Arpaci-Dusseau, University of Wisconsin—Madison Talk at usenix 2014 Published on Sep 4, 2014 by USENIX Association Videos Somewhat related to the article drh recently wrote about using sqlite as an application data store. -- Regards, Kees Nuyt ___ sqlite-users mailing list sqlite-us...@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users -- There is nothing more pleasant than traveling and meeting new people! Genghis Khan Maranatha! John McKown -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Fwd: [sqlite] presentation about ordering and atomicity of filesystems
On Fri, 12 Sep 2014 06:28:47 -0500, John McKown wrote: This is not about the z, per se, but is interesting. I don't think that any of the IBM systems have this type of filesystem. Hum, perhaps the i? John, you gotta stop posting this stuff just before midnight on a Friday night !!!. I got partway into it, but it'd be almost breakfast by the time it finished - maybe later I'll get back to it :0) Shane ... -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Fwd: [sqlite] presentation about ordering and atomicity of filesystems
On Fri, Sep 12, 2014 at 9:08 AM, Shane Ginnane ibm-m...@tpg.com.au wrote: On Fri, 12 Sep 2014 06:28:47 -0500, John McKown wrote: This is not about the z, per se, but is interesting. I don't think that any of the IBM systems have this type of filesystem. Hum, perhaps the i? John, you gotta stop posting this stuff just before midnight on a Friday night !!!. I got partway into it, but it'd be almost breakfast by the time it finished - maybe later I'll get back to it :0) Shane ... It was only after midnight because you Australian don't set your clocks correctly. I posted that in the middle of the day. At least according to GMT time. Which is the only true time. Right? [grin/] -- There is nothing more pleasant than traveling and meeting new people! Genghis Khan Maranatha! John McKown -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Fwd: [sqlite] presentation about ordering and atomicity of filesystems
john.archie.mck...@gmail.com (John McKown) writes: This is not about the z, per se, but is interesting. I don't think that any of the IBM systems have this type of filesystem. Hum, perhaps the i? original CMS filesystem from mid-60s ... was somewhat brought over from CTSS ... would simulate fixed-block on CKD dasd (somewhat inverse of the current situation where there hasn't been any CKD DASD manufactured for decades and simulated on industry standard fixed-block). The default was to not replace/update existing record ... but write to newly allocated location ... then periodically update alloction map, file directory (aka VTOC) ... also to new location and then rewrite the MFD record (in-place, single write that would flip between the old set of records and the new set of records). however, ibm CKD dasd had a peculiar power failure mode ... that might occur in the middle of a write operation ... there would be sufficient power to complete a write in progress ... but not sufficient power to continue transmitting the data from processor memory over the channel ... so the controller completed the write operation with all zeros (and no indication of a read/write failure). As far as i know, none of the other mainframe systems made any software provisions to handle this particular failure mode of ibm ckd dasd. As a result, in the mid-70s, the CMS extended file system had fix ... which change to a pair of MFD records and would alternatively write to the pair of records. On initial startup ... it would check both records to see if both records had been written correctly (no zeros propogated at the end of the record) and choose the most recent valid record. UNIX filesystem has been notorious for writting records in arbitrary order ... especially the filesystem control information (metadata) and after a shutdown/failure w/o clean shutdown (all records cleanly written to disk) ... a start up after non-clean shutdown would have to reread all records looking for inconsistencies ... which might take large tens of minutes. Circa 1990, aixv3 for rs/6000 enhanced the unix filesystem with logging changes to the file directory information (metadata) ... a side-effect was aix could almost immediately record/startup ... by rerunning logged information (it doesn't do anything for consistency of file data ... but does fix the unix filesystem integrity problem). AIX JFS filesystem http://en.wikipedia.org/wiki/JFS_%28file_system%29 http://www.linuxjournal.com/article/6268 the original implementation relied on special hardware in 801/risc where the unix filesystem control information (metadata) was placed in memory area that was specially identified to catch all changes. then all changes to filesystem was captured and journaled ... w/o having to change all the unix code to explicitly call the journaling/logging facility. The original claim was that the hardware implementation was also faster than putting in explicit logging/journaling calls. However, when the ibm paloalto group was porting JFS to generic hardware (w/o the 801/risc features), they had to put in explicit logging/journaling calls for changes. When they back ported that implementation to rs/6000, it turns out the explicit calls ran faster than the original implementation. as an aside, we relied on JFS for faster restart when we did ibm's ha/cmp (high availability, cluster multiprocessor) ... some past posts http://www.garlic.com/~lynn/subtopic.html#hacmp past posts mentioning 801/risc http://www.garlic.com/~lynn/subtopic.html#801 recent references to Jim Gray credited with formalizing transaction semantics and ACID properties http://www.garlic.com/~lynn/2014f.html#69 Is end of mainframe near ? http://www.garlic.com/~lynn/2014g.html#2 Is end of mainframe near ? http://www.garlic.com/~lynn/2014g.html#14 Is end of mainframe near ? http://www.garlic.com/~lynn/2014g.html#15 Is it time for a revolution to replace TLS? http://www.garlic.com/~lynn/2014g.html#38 Fifty Years of BASIC, the Programming Language That Made Computers Personal http://www.garlic.com/~lynn/2014k.html#2 Flat (VSAM or other) files still in use? -- virtualization experience starting Jan1968, online at home since Mar1970 -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: [sqlite] presentation about ordering and atomicity of filesystems
Isn't z/OS Unix HFS/ZFS that type of file system, on top of a VSAM linear dataset? I haven't the time now to listen to the whole 90 minutes of video, but the first 13 minutes were enlightening. Peter -Original Message- From: IBM Mainframe Discussion List [mailto:IBM-MAIN@LISTSERV.UA.EDU] On Behalf Of John McKown Sent: Friday, September 12, 2014 7:29 AM To: IBM-MAIN@LISTSERV.UA.EDU Subject: Fwd: [sqlite] presentation about ordering and atomicity of filesystems This is not about the z, per se, but is interesting. I don't think that any of the IBM systems have this type of filesystem. Hum, perhaps the i? -- This message and any attachments are intended only for the use of the addressee and may contain information that is privileged and confidential. If the reader of the message is not the intended recipient or an authorized representative of the intended recipient, you are hereby notified that any dissemination of this communication is strictly prohibited. If you have received this communication in error, please notify us immediately by e-mail and delete the message and any attachments from your system. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Fwd: [sqlite] presentation about ordering and atomicity of filesystems
On 12 September 2014 10:38, Anne Lynn Wheeler l...@garlic.com wrote: however, ibm CKD dasd had a peculiar power failure mode ... that might occur in the middle of a write operation ... there would be sufficient power to complete a write in progress ... but not sufficient power to continue transmitting the data from processor memory over the channel ... so the controller completed the write operation with all zeros (and no indication of a read/write failure). As far as i know, none of the other mainframe systems made any software provisions to handle this particular failure mode of ibm ckd dasd. That's a not unreasonable implementation of an architected behaviour on the BT/OEMI channel to CU interface, independent of power failure. If an I/O reset is received by the control unit while a write is in progress, it completes the write with zeros. What would be a more reasonable behaviour on a disk with little or no buffering? So in theory it's possible to corrupt data on disk just by hitting System Reset (or Load) during a disk write. If you look at the probability it's pretty unlikely, but I worked at one place that had a strict rule about hitting stop and waiting a few seconds before doing the reset or load. Tony H. -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Fwd: [sqlite] presentation about ordering and atomicity of filesystems
t...@harminc.net (Tony Harminc) writes: That's a not unreasonable implementation of an architected behaviour on the BT/OEMI channel to CU interface, independent of power failure. If an I/O reset is received by the control unit while a write is in progress, it completes the write with zeros. What would be a more reasonable behaviour on a disk with little or no buffering? So in theory it's possible to corrupt data on disk just by hitting System Reset (or Load) during a disk write. If you look at the probability it's pretty unlikely, but I worked at one place that had a strict rule about hitting stop and waiting a few seconds before doing the reset or load. re: http://www.garlic.com/~lynn/2014k.html#7 [sqlite] presentation about ordering and atomicity of filesystems/a part of the issue was that incomplete write ... with propogated zeros ... would also (then) rewrite the error correcting codes for the record (with propogated zeros) ... so there wouldn't even be an error indication that the write was performed incorrectly (installation wouldn't even know to perform restore because of write error). later fba disks ... especially in conjunction with raid ... had requirement that single block write would complete correctly once started. before raid and with fba-512 blocks and 4k-byte logical blocks ... the hardware guarantee only applied to the physical 512byte block ... which could result in an inconsistent 4k-byte logical record (8 physical 512byte blocks) with no error condition. As a result, there had to be special software provisions by filesystems with 4k-byte logical records mapped to fba-512. this particular issue has been eliminated with the recent move from fba-512 to fba-4096 ... so 4k-byte logical block filesystems now match the physical block size. part of the move from fba-512 to fba-4096 is that rather than eight error correcting codes per 4k-bytes ... there is only single error correcting code ... increasing the effective data space on disk http://en.wikipedia.org/wiki/Advanced_Format past posts mentioning fba, ckd, multi-track search, etc http://www.garlic.com/~lynn/submain.html#dasd -- virtualization experience starting Jan1968, online at home since Mar1970 -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN
Re: Fwd: [sqlite] presentation about ordering and atomicity of filesystems
On Fri, 12 Sep 2014 09:16:54 -0700, Anne Lynn Wheeler wrote: re: http://www.garlic.com/~lynn/2014k.html#7 [sqlite] presentation about ordering and atomicity of filesystems part of the issue was that incomplete write ... with propogated zeros ... would also (then) rewrite the error correcting codes for the record (with propogated zeros) ... so there wouldn't even be an error indication that the write was performed incorrectly (installation wouldn't even know to perform restore because of write error). It's almost as if they concealed the error on purpose. Well, not quite; it depends on where in the data path the ECC was generated -- it should have been done farther upstream. later fba disks ... especially in conjunction with raid ... had requirement that single block write would complete correctly once started. ... With what probability, and subject to what assumptions? If the data lead to the write head fails mechanically at a critical time, a bad block will be written. Negligibly improbable? Yes. Physically impossible? No. Detectable by ECC? Probably. -- gil -- For IBM-MAIN subscribe / signoff / archive access instructions, send email to lists...@listserv.ua.edu with the message: INFO IBM-MAIN