Ok, so I have had more success. I thought I would recount some of this for those who might be suffering from similar issues and tie it all together in a cleaner package.
First off the config: The set up included 6 WD green 1.5TB HDs in a RAIDz2 config under b134 with dedup enabled. This configuration had several instabilities, as mentioned in earlier posts. These included system freezes, commands locking up and not being killable, weird file anomalies, and all of it culminated in an almost total corruption of my zpool named megapool. In megapool I had sub partitioned it to have a filesystem called "data" and one called "users". The actions that ultimately resulted in the failure of the pool all occurred in the data filesystem. At this time it is unclear if the corruption was pool wide or if it was localized to the data filesystem, but unfortunately it is difficult to lock down because the '-R' switch for zpool import seems to simply fall through in b134 (I have not tested other builds, such as b111 "2009.06"). However there is evidence that it might be largely localized to the filesystem, although not completely confined due to the fact that the dedup tables are pool wide. Moments before the failure: Just before the failure occurred I had been deleting files over SMB using ZFS's built in CIFS and not SAMBA. After I finished this the pool was still responsive. In fact I watched a few streaming videos from the NAS before going to bed. So we know that the pool and filesystems were operational while they were in their mounted state. I then went to bed, and when I got up the next day the server would respond to ping but not to SMB, VNC or SSH. I couldn't check the console because of the server's location at the time and because of the absence of a monitor. I then power cycled the box. It started to respond to ping, but never responded to any other services. It was at this point I relocated the server so I could look at the console and saw it was hanging while probing and initializing ZFS's filesystems, stuck on 1/16. Steps taken to recover: I have managed to finally recover the entire contents of the users filesystem. This is a HUGE win as these files are, for the most part, not duplicated else where and represent original content. To do this I first booted from a live CD of b134. I did this because the server would not boot otherwise. It would hang at the zpool probing stage. So when I booted from the live CD I was able to bypass the ZFS filesystem initiation, as the CD didn't know about the zpools ahead of time. I then could do a "zpool import -f megapool". This command would never complete but if I opened another terminal I could do a "zpool list" or "zpool status" and I could see megapool, and "zfs list" would show megapool's filesystems. It is my belief that the import command never finished because it was attempting to mount all of megapool's filesystems including the almost certainly corrupted data filesystem. I then did an "iostat -xn 1" to monitor the hard drive IO and confirm operations were indeed taking place, seeing as everything was under suspicion at this point. I then setup a new pool, which I called newhope (yes, I'm a star wars fan, but I thought this was fitting), and I did a "zfs send megapool/users/snapshot | zfs receive newhope/users_backup". I watched the IO and I confirmed that the data was indeed being taken from megapool and placed onto newhope. I let this run for as long as I had IO confirmation that something was happening. It turned out that the command actually terminated when the IO stopped flowing. This was a good sign and I could indeed see the snapshot on newhope and it was browsable. However there was a problem. Of the 51GB only 41GB was actually being seen, and indeed there were many missing files. I tried several more copies from older snapshots, and exact same results occurred, which isn't too surprising considering ZFS's pointer architecture. The interesting thing is the 41GB represented the original files I placed onto the NAS when I built in back in April and didn't represent any of the files that had been placed on it in the interim. I'm not sure what to make of that but I think it's worth noting. I then decided I had all the user data I was likely to get and gave up on attempting more send/receive attempts. However, there was one snapshot for the data filesystem and I decided it wouldn't hurt to try to get some of that data back as well. So I did the "zfs send megapool/data/snapshot | zfs receive newhope/data_backup" and I watched the IO. This took a long time, I expected it would as it was a 450GB snapshot, eventually though the IO stopped flowing and the command did not terminate and the snapshot never appeared in newhope, though it was reported in the "zfs list -t snapshot". This helped to reinforce my opinion that the corruption was centered around the data filesystem. It was at this time I decided to setup my new server. I didn't simply want to re-implement on top of b134 minus dedup, so I decided to go with OpenSolaris release 2009.06 b111. I thought this would likely be more stable as it's a main release that's been out for a while. So I then installed the OS and created a new pool out of my 6 HDs and decided to call it "tank" (kind of how a parent might name his child hoping he'll live up to it). However, I was not able to import the data from newhope because it was created under b134. So I booted with the b134 live CD again and did a "zpool import -f tank" and then did a "zfs send newhope/users_backup | zfs receive tank/users/backup" and again I watched iostat to confirm data was moving and things did complete as hoped. I then rebooted into b111, however, the snapshot were not able to mount. So it was back to b134 to cp the data from the snapshots into the FS proper. I then booted into b111 again and confirmed the data was inde ed there and did a "zfs destroy -f tank/users/backup" to clean up the snapshot. I discovered something odd at this point, not that odd wasn't the word of the day anyway, but it was a good odd this time. After cleaning up my permissions, I was browsing the recovered data and all the old directories that had been present in b134 were still present and accounted for, but now there were these new directories with a lock and red minus on them as well. I couldn't browse them, they said I didn't have permission. I tried looking at them with root, and I could get into the director, but no files were in them. So my momentary ray of hope was then crushed again. I continued on with my server configuration. I decided it was time to setup SMB and I decided I didn't want to use the built in CIFS this time. The reason for this is because it appears that ZFS's CIFS is not compatible with Juniper Network's SSL VPN the IVE, more specifically their web file browsing. Actually I know from my time at Juniper that CIFS is really a pain in general, but I also know I never really had a problem with SAMBA so I decided to go that route. So I configured SAMBA through SWAT. Here is a nice SAMBA guide in case you need one: http://wikis.sun.com/display/OpenSolarisInfo200906/How+to+Set+Up+Samba+in+the+OpenSolaris+2009.06+Release After I got my shares up and going I discovered while browsing through them that the weird directories with locks and red minus icons appeared normal! I then opened one and discovered all my files were there! So I took this opportunity to copy all the files over to my Mac and confirm their integrity. These files represented the missing 10GB of data that was never accounted for. I honestly have no idea what happened here, but apparently some part of the b134 system was still tracking these files and did get them copied even though they weren't registering in gnome or to ls. It's odd that I wasn't able to see these on b111 with root privileges from gnome or from CLI either, but through SAMBA I was able to see these. Anyway, after confirming I had all the files and they were in good shape, I then deleted the backup from tank/users and copied the files back over from the Mac. Now all of the files are visible from both gnome and CLI with none of those locks or minuses. Lessons learned: * keep backups (obvious) * state of the art is not without it's price (should be obvious) * send/receive is your friend * SAMBA is nice and may be better than the built in CIFS * Lots of filesystems are a good thing! Actually let's look at that for a moment. The way I am presently implementing is by having a "data" FS and "users" FS, like I had, but I am now placing torrents in their own FS. The reason I'm doing this is because torrent files have a lot of IO and data structure changes happening which place a high level of stress on the storage subsystem. I know that it's not as bad as a large heavily used database, but it gets bad enough, it's probably the most stressful thing a single user can do as far as data structure mechanics are concerned. This means that the most likely place to have a failure would be in the FS managing the torrents. So I placed them in their own FS to isolate that from the data in "data" and "users". I think if I had this approach before I may have been able to recover more than just my users' data. So sum up, anything you think is particularly important or particularly stressful should be placed in it's own FS, and with ZFS that's so easy to do there is no reason not to. Observations regarding b134: Though I am a little upset with b134 right now there are some observations I would like to make. For starters this was the first time I had worked with Solaris for a LONG time. When I last worked with Solaris ZFS wasn't supported on the root drive and it was a relatively new development anyway. I did a lot of research and I discovered that ZFS likes memory. In fact it will use as much as you will give it. So when I implemented b134 and discovered that my total system usage was between 350MB to 400MB, before VMs, I was surprised. Now that I am running b111 I can contrast a few things. The memory usage is more like I expect and the responsiveness of ZFS in general is much better. Both in terms of IO but also in terms of command response. I'm pretty sure the command responsiveness is just because b111 is more mature code. However I'm not sure how much the IO responsiveness is linked to the memory usage verses code maturity. One might point out I was running dedup on b134 but I don't think the IO has much to do with the dedup because in my testing, and I did a lot before implementing, I didn't see a noticeable difference when dedup was on verses when it was off for none duplicated data and for duplicated data writes were MUCH faster, as expected, since is wasn't IO bound at that point. Now I have enabled sha256 hashing on b111 and I can tell you that sha256 hashing has a FAR smaller CPU cycle penalty in B134. I am running an AMD Phenom II x4 at 2.5GHz, in b134 I was seeing about 8% ~ 10% CPU usage across all cores storage access, in b111 since enabling sha256 I see about 25% ~ 30% usage across all cores with file access and spiking to 80% ~ 90$ when the write group is performed. Now the spike is just for a second but it is there, and I observed no such spike in b134. Though this is expected as the sha256 message digest was greatly improved for encryption and that code branch was incorporated in b131. I just wanted to share this so people had a first hand report of the improvement. There were a lot of little things but those are the main observations. I think if the code ever matures and they get around to releasing another version of OpenSolaris it will be quite nice. Thoughts in general: In general I want to say I am VERY impressed with ZFS. Dedup is not a small feature and it interacts with almost every element of the ZFS subsystem and so to have a bug in it can cause some catastrophic effects, as we have seen here. However, even through I experienced an almost complete failure I was able to restore my most valuable data from another FS in the same pool without the aid of outside utilities, which is good because I don't know that any exist. I would say this is no small achievement under circumstances and it speaks highly of ZFS's natural fault tolerance. Though this experience has soured me greatly in regards to developer releases, I believe I have been come a true believer of the ZFS way. I hope Oracle will continue to invest and mature this technology, as there really is nothing else like it out there. -- This message posted from opensolaris.org _______________________________________________ opensolaris-help mailing list opensolaris-help@opensolaris.org