Re: [osol-help] The epic of the OpenSolaris b134 ZFS meltdown. ...a cautionary tale.

James Wed, 28 Jul 2010 15:32:31 -0700

Ok, so I have had more success.  I thought I would recount some of this for 
those who might be suffering from similar issues and tie it all together in a 
cleaner package.

First off the config:

The set up included 6 WD green 1.5TB HDs in a RAIDz2 config under b134 with
dedup enabled. This configuration had several instabilities, as mentioned in
earlier posts. These included system freezes, commands locking up and not
being killable, weird file anomalies, and all of it culminated in an almost
total corruption of my zpool named megapool. In megapool I had sub partitioned
it to have a filesystem called "data" and one called "users". The actions that
ultimately resulted in the failure of the pool all occurred in the data
filesystem. At this time it is unclear if the corruption was pool wide or if it
was localized to the data filesystem, but unfortunately it is difficult to lock
down because the '-R' switch for zpool import seems to simply fall through in
b134 (I have not tested other builds, such as b111 "2009.06"). However there
is evidence that it might be largely localized to the filesystem, although not
completely confined due to the fact that the dedup tables
are pool wide.

Moments before the failure:

Just before the failure occurred I had been deleting files over SMB using ZFS's
built in CIFS and not SAMBA. After I finished this the pool was still
responsive. In fact I watched a few streaming videos from the NAS before going
to bed. So we know that the pool and filesystems were operational while they
were in their mounted state. I then went to bed, and when I got up the next
day the server would respond to ping but not to SMB, VNC or SSH. I couldn't
check the console because of the server's location at the time and because of
the absence of a monitor. I then power cycled the box. It started to respond
to ping, but never responded to any other services. It was at this point I
relocated the server so I could look at the console and saw it was hanging
while probing and initializing ZFS's filesystems, stuck on 1/16.

Steps taken to recover:

I have managed to finally recover the entire contents of the users filesystem.
This is a HUGE win as these files are, for the most part, not duplicated else
where and represent original content.

To do this I first booted from a live CD of b134. I did this because the
server would not boot otherwise. It would hang at the zpool probing stage. So
when I booted from the live CD I was able to bypass the ZFS filesystem
initiation, as the CD didn't know about the zpools ahead of time. I then could
do a "zpool import -f megapool". This command would never complete but if I
opened another terminal I could do a "zpool list" or "zpool status" and I could
see megapool, and "zfs list" would show megapool's filesystems. It is my
belief that the import command never finished because it was attempting to
mount all of megapool's filesystems including the almost certainly corrupted
data filesystem.

I then did an "iostat -xn 1" to monitor the hard drive IO and confirm
operations were indeed taking place, seeing as everything was under suspicion
at this point. I then setup a new pool, which I called newhope (yes, I'm a
star wars fan, but I thought this was fitting), and I did a "zfs send
megapool/users/snapshot | zfs receive newhope/users_backup". I watched the IO
and I confirmed that the data was indeed being taken from megapool and placed
onto newhope. I let this run for as long as I had IO confirmation that
something was happening. It turned out that the command actually terminated
when the IO stopped flowing. This was a good sign and I could indeed see the
snapshot on newhope and it was browsable.

However there was a problem. Of the 51GB only 41GB was actually being seen,
and indeed there were many missing files. I tried several more copies from
older snapshots, and exact same results occurred, which isn't too surprising
considering ZFS's pointer architecture. The interesting thing is the 41GB
represented the original files I placed onto the NAS when I built in back in
April and didn't represent any of the files that had been placed on it in the
interim. I'm not sure what to make of that but I think it's worth noting.

I then decided I had all the user data I was likely to get and gave up on
attempting more send/receive attempts. However, there was one snapshot for the
data filesystem and I decided it wouldn't hurt to try to get some of that data
back as well. So I did the "zfs send megapool/data/snapshot | zfs receive
newhope/data_backup" and I watched the IO. This took a long time, I expected
it would as it was a 450GB snapshot, eventually though the IO stopped flowing
and the command did not terminate and the snapshot never appeared in newhope,
though it was reported in the "zfs list -t snapshot". This helped to reinforce
my opinion that the corruption was centered around the data filesystem.

It was at this time I decided to setup my new server. I didn't simply want to
re-implement on top of b134 minus dedup, so I decided to go with OpenSolaris
release 2009.06 b111. I thought this would likely be more stable as it's a
main release that's been out for a while. So I then installed the OS and
created a new pool out of my 6 HDs and decided to call it "tank" (kind of how a
parent might name his child hoping he'll live up to it). However, I was not
able to import the data from newhope because it was created under b134. So I
booted with the b134 live CD again and did a "zpool import -f tank" and then
did a "zfs send newhope/users_backup | zfs receive tank/users/backup" and again
I watched iostat to confirm data was moving and things did complete as hoped.
I then rebooted into b111, however, the snapshot were not able to mount. So it
was back to b134 to cp the data from the snapshots into the FS proper. I then
booted into b111 again and confirmed the data was inde
ed there and did a "zfs destroy -f tank/users/backup" to clean up the
snapshot.

I discovered something odd at this point, not that odd wasn't the word of the
day anyway, but it was a good odd this time. After cleaning up my permissions,
I was browsing the recovered data and all the old directories that had been
present in b134 were still present and accounted for, but now there were these
new directories with a lock and red minus on them as well. I couldn't browse
them, they said I didn't have permission. I tried looking at them with root,
and I could get into the director, but no files were in them. So my momentary
ray of hope was then crushed again.

I continued on with my server configuration. I decided it was time to setup
SMB and I decided I didn't want to use the built in CIFS this time. The reason
for this is because it appears that ZFS's CIFS is not compatible with Juniper
Network's SSL VPN the IVE, more specifically their web file browsing. Actually
I know from my time at Juniper that CIFS is really a pain in general, but I
also know I never really had a problem with SAMBA so I decided to go that
route. So I configured SAMBA through SWAT.

Here is a nice SAMBA guide in case you need one:

http://wikis.sun.com/display/OpenSolarisInfo200906/How+to+Set+Up+Samba+in+the+OpenSolaris+2009.06+Release

After I got my shares up and going I discovered while browsing through them
that the weird directories with locks and red minus icons appeared normal! I
then opened one and discovered all my files were there! So I took this
opportunity to copy all the files over to my Mac and confirm their integrity.
These files represented the missing 10GB of data that was never accounted for.
I honestly have no idea what happened here, but apparently some part of the
b134 system was still tracking these files and did get them copied even though
they weren't registering in gnome or to ls. It's odd that I wasn't able to see
these on b111 with root privileges from gnome or from CLI either, but through
SAMBA I was able to see these.

Anyway, after confirming I had all the files and they were in good shape, I
then deleted the backup from tank/users and copied the files back over from the
Mac. Now all of the files are visible from both gnome and CLI with none of
those locks or minuses.

Lessons learned:

* keep backups (obvious)
* state of the art is not without it's price (should be obvious)
* send/receive is your friend
* SAMBA is nice and may be better than the built in CIFS
* Lots of filesystems are a good thing!

Actually let's look at that for a moment. The way I am presently implementing
is by having a "data" FS and "users" FS, like I had, but I am now placing
torrents in their own FS. The reason I'm doing this is because torrent files
have a lot of IO and data structure changes happening which place a high level
of stress on the storage subsystem. I know that it's not as bad as a large
heavily used database, but it gets bad enough, it's probably the most stressful
thing a single user can do as far as data structure mechanics are concerned.
This means that the most likely place to have a failure would be in the FS
managing the torrents. So I placed them in their own FS to isolate that from
the data in "data" and "users". I think if I had this approach before I may
have been able to recover more than just my users' data. So sum up, anything
you think is particularly important or particularly stressful should be placed
in it's own FS, and with ZFS that's so easy to do there is
no reason not to.

Observations regarding b134:

Though I am a little upset with b134 right now there are some observations I
would like to make. For starters this was the first time I had worked with
Solaris for a LONG time. When I last worked with Solaris ZFS wasn't supported
on the root drive and it was a relatively new development anyway. I did a lot
of research and I discovered that ZFS likes memory. In fact it will use as
much as you will give it. So when I implemented b134 and discovered that my
total system usage was between 350MB to 400MB, before VMs, I was surprised.

Now that I am running b111 I can contrast a few things. The memory usage is
more like I expect and the responsiveness of ZFS in general is much better.
Both in terms of IO but also in terms of command response. I'm pretty sure the
command responsiveness is just because b111 is more mature code. However I'm
not sure how much the IO responsiveness is linked to the memory usage verses
code maturity. One might point out I was running dedup on b134 but I don't
think the IO has much to do with the dedup because in my testing, and I did a
lot before implementing, I didn't see a noticeable difference when dedup was on
verses when it was off for none duplicated data and for duplicated data writes
were MUCH faster, as expected, since is wasn't IO bound at that point.

Now I have enabled sha256 hashing on b111 and I can tell you that sha256
hashing has a FAR smaller CPU cycle penalty in B134. I am running an AMD
Phenom II x4 at 2.5GHz, in b134 I was seeing about 8% ~ 10% CPU usage across
all cores storage access, in b111 since enabling sha256 I see about 25% ~ 30%
usage across all cores with file access and spiking to 80% ~ 90$ when the write
group is performed. Now the spike is just for a second but it is there, and I
observed no such spike in b134. Though this is expected as the sha256 message
digest was greatly improved for encryption and that code branch was
incorporated in b131. I just wanted to share this so people had a first hand
report of the improvement.

There were a lot of little things but those are the main observations. I think
if the code ever matures and they get around to releasing another version of
OpenSolaris it will be quite nice.

Thoughts in general:

In general I want to say I am VERY impressed with ZFS. Dedup is not a small
feature and it interacts with almost every element of the ZFS subsystem and so
to have a bug in it can cause some catastrophic effects, as we have seen here.
However, even through I experienced an almost complete failure I was able to
restore my most valuable data from another FS in the same pool without the aid
of outside utilities, which is good because I don't know that any exist. I
would say this is no small achievement under circumstances and it speaks highly
of ZFS's natural fault tolerance. Though this experience has soured me greatly
in regards to developer releases, I believe I have been come a true believer of
the ZFS way. I hope Oracle will continue to invest and mature this technology,
as there really is nothing else like it out there.
--
This message posted from opensolaris.org
_______________________________________________
opensolaris-help mailing list
opensolaris-help@opensolaris.org

Re: [osol-help] The epic of the OpenSolaris b134 ZFS meltdown. ...a cautionary tale.

Reply via email to