Ok, so I have had more success.  I thought I would recount some of this for 
those who might be suffering from similar issues and tie it all together in a 
cleaner package.  


First off the config:

The set up included 6 WD green 1.5TB HDs in a RAIDz2 config under b134 with 
dedup enabled.  This configuration had several instabilities, as mentioned in 
earlier posts.  These included system freezes, commands locking up and not 
being killable, weird file anomalies, and all of it culminated in an almost 
total corruption of my zpool named megapool.  In megapool I had sub partitioned 
it to have a filesystem called "data" and one called "users".  The actions that 
ultimately resulted in the failure of the pool all occurred in the data 
filesystem. At this time it is unclear if the corruption was pool wide or if it 
was localized to the data filesystem, but unfortunately it is difficult to lock 
down because the '-R' switch for zpool import seems to simply fall through in 
b134 (I have not tested other builds, such as b111 "2009.06").  However there 
is evidence that it might be largely localized to the filesystem, although not 
completely confined due to the fact that the dedup tables 
 are pool wide.  


Moments before the failure:

Just before the failure occurred I had been deleting files over SMB using ZFS's 
built in CIFS and not SAMBA.  After I finished this the pool was still 
responsive.  In fact I watched a few streaming videos from the NAS before going 
to bed.  So we know that the pool and filesystems were operational while they 
were in their mounted state.  I then went to bed, and when I got up the next 
day the server would respond to ping but not to SMB, VNC or SSH.  I couldn't 
check the console because of the server's location at the time and because of 
the absence of a monitor.  I then power cycled the box.  It started to respond 
to ping, but never responded to any other services.  It was at this point I 
relocated the server so I could look at the console and saw it was hanging 
while probing and initializing ZFS's filesystems, stuck on 1/16.


Steps taken to recover:

I have managed to finally recover the entire contents of the users filesystem. 
This is a HUGE win as these files are, for the most part, not duplicated else 
where and represent original content. 

To do this I first booted from a live CD of b134.  I did this because the 
server would not boot otherwise.  It would hang at the zpool probing stage.  So 
when I booted from the live CD I was able to bypass the ZFS filesystem 
initiation, as the CD didn't know about the zpools ahead of time.  I then could 
do a "zpool import -f megapool".  This command would never complete but if I 
opened another terminal I could do a "zpool list" or "zpool status" and I could 
see megapool, and "zfs list" would show megapool's filesystems.  It is my 
belief that the import command never finished because it was attempting to 
mount all of megapool's filesystems including the almost certainly corrupted 
data filesystem.  

I then did an "iostat -xn 1" to monitor the hard drive IO and confirm 
operations were indeed taking place, seeing as everything was under suspicion 
at this point.  I then setup a new pool, which I called newhope (yes, I'm a 
star wars fan, but I thought this was fitting), and I did a "zfs send 
megapool/users/snapshot | zfs receive newhope/users_backup".  I watched the IO 
and I confirmed that the data was indeed being taken from megapool and placed 
onto newhope.  I let this run for as long as I had IO confirmation that 
something was happening.  It turned out that the command actually terminated 
when the IO stopped flowing.  This was a good sign and I could indeed see the 
snapshot on newhope and it was browsable.

However there was a problem.  Of the 51GB only 41GB was actually being seen, 
and indeed there were many missing files.  I tried several more copies from 
older snapshots, and exact same results occurred, which isn't too surprising 
considering ZFS's pointer architecture.  The interesting thing is the 41GB 
represented the original files I placed onto the NAS when I built in back in 
April and didn't represent any of the files that had been placed on it in the 
interim.  I'm not sure what to make of that but I think it's worth noting.  

I then decided I had all the user data I was likely to get and gave up on 
attempting more send/receive attempts.  However, there was one snapshot for the 
data filesystem and I decided it wouldn't hurt to try to get some of that data 
back as well.  So I did the "zfs send megapool/data/snapshot | zfs receive 
newhope/data_backup" and I watched the IO.  This took a long time, I expected 
it would as it was a 450GB snapshot, eventually though the IO stopped flowing 
and the command did not terminate and the snapshot never appeared in newhope, 
though it was reported in the "zfs list -t snapshot".  This helped to reinforce 
my opinion that the corruption was centered around the data filesystem. 

It was at this time I decided to setup my new server.  I didn't simply want to 
re-implement on top of b134 minus dedup, so I decided to go with OpenSolaris 
release 2009.06 b111.  I thought this would likely be more stable as it's a 
main release that's been out for a while.  So I then installed the OS and 
created a new pool out of my 6 HDs and decided to call it "tank" (kind of how a 
parent might name his child hoping he'll live up to it).  However, I was not 
able to import the data from newhope because it was created under b134.  So I 
booted with the b134 live CD again and did a "zpool import -f tank" and then 
did a "zfs send newhope/users_backup | zfs receive tank/users/backup" and again 
I watched iostat to confirm data was moving and things did complete as hoped.  
I then rebooted into b111, however, the snapshot were not able to mount.  So it 
was back to b134 to cp the data from the snapshots into the FS proper.  I then 
booted into b111 again and confirmed the data was inde
 ed there and did a "zfs destroy -f tank/users/backup" to clean up the 
snapshot.  

I discovered something odd at this point, not that odd wasn't the word of the 
day anyway, but it was a good odd this time.  After cleaning up my permissions, 
I was browsing the recovered data and all the old directories that had been 
present in b134 were still present and accounted for, but now there were these 
new directories with a lock and red minus on them as well.  I couldn't browse 
them, they said I didn't have permission.  I tried looking at them with root, 
and I could get into the director, but no files were in them. So my momentary 
ray of hope was then crushed again.  

I continued on with my server configuration.  I decided it was time to setup 
SMB and I decided I didn't want to use the built in CIFS this time.  The reason 
for this is because it appears that ZFS's CIFS is not compatible with Juniper 
Network's SSL VPN the IVE, more specifically their web file browsing.  Actually 
I know from my time at Juniper that CIFS is really a pain in general, but I 
also know I never really had a problem with SAMBA so I decided to go that 
route.   So I configured SAMBA through SWAT.

Here is a nice SAMBA guide in case you need one:

http://wikis.sun.com/display/OpenSolarisInfo200906/How+to+Set+Up+Samba+in+the+OpenSolaris+2009.06+Release

After I got my shares up and going I discovered while browsing through them 
that the weird directories with locks and red minus icons appeared normal!  I 
then opened one and discovered all my files were there!  So I took this 
opportunity to copy all the files over to my Mac and confirm their integrity.  
These files represented the missing 10GB of data that was never accounted for.  
I honestly have no idea what happened here, but apparently some part of the 
b134 system was still tracking these files and did get them copied even though 
they weren't registering in gnome or to ls.  It's odd that I wasn't able to see 
these on b111 with root privileges from gnome or from CLI either, but through 
SAMBA I was able to see these.  

Anyway, after confirming I had all the files and they were in good shape, I 
then deleted the backup from tank/users and copied the files back over from the 
Mac.  Now all of the files are visible from both gnome and CLI with none of 
those locks or minuses.  


Lessons learned:

* keep backups (obvious)
* state of the art is not without it's price (should be obvious)
* send/receive is your friend
* SAMBA is nice and may be better than the built in CIFS
* Lots of filesystems are a good thing!  

Actually let's look at that for a moment.  The way I am presently implementing 
is by having a "data" FS and "users" FS, like I had, but I am now placing 
torrents in their own FS.  The reason I'm doing this is because torrent files 
have a lot of IO and data structure changes happening which place a high level 
of stress on the storage subsystem.  I know that it's not as bad as a large 
heavily used database, but it gets bad enough, it's probably the most stressful 
thing a single user can do as far as data structure mechanics are concerned.  
This means that the most likely place to have a failure would be in the FS 
managing the torrents.  So I placed them in their own FS to isolate that from 
the data in "data" and "users".  I think if I had this approach before I may 
have been able to recover more than just my users' data.  So sum up, anything 
you think is particularly important or particularly stressful should be placed 
in it's own FS, and with ZFS that's so easy to do there is 
 no reason not to.    


Observations regarding b134:

Though I am a little upset with b134 right now there are some observations I 
would like to make.  For starters this was the first time I had worked with 
Solaris for a LONG time.  When I last worked with Solaris ZFS wasn't supported 
on the root drive and it was a relatively new development anyway.  I did a lot 
of research and I discovered that ZFS likes memory.  In fact it will use as 
much as you will give it.  So when I implemented b134 and discovered that my 
total system usage was between 350MB to 400MB, before VMs, I was surprised.  

Now that I am running b111 I can contrast a few things. The memory usage is 
more like I expect and the responsiveness of ZFS in general is much better.  
Both in terms of IO but also in terms of command response.  I'm pretty sure the 
command responsiveness is just because b111 is more mature code. However I'm 
not sure how much the IO responsiveness is linked to the memory usage verses 
code maturity.  One might point out I was running dedup on b134 but I don't 
think the IO has much to do with the dedup because in my testing, and I did a 
lot before implementing, I didn't see a noticeable difference when dedup was on 
verses when it was off for none duplicated data and for duplicated data writes 
were MUCH faster, as expected, since is wasn't IO bound at that point. 

Now I have enabled sha256 hashing on b111 and I can tell you that sha256 
hashing has a FAR smaller CPU cycle penalty in B134.  I am running an AMD 
Phenom II x4 at 2.5GHz, in b134 I was seeing about 8% ~ 10% CPU usage across 
all cores storage access, in b111 since enabling sha256 I see about 25% ~ 30% 
usage across all cores with file access and spiking to 80% ~ 90$ when the write 
group is performed.  Now the spike is just for a second but it is there, and I 
observed no such spike in b134.  Though this is expected as the sha256 message 
digest was greatly improved for encryption and that code branch was 
incorporated in b131. I just wanted to share this so people had a first hand 
report of the improvement.

There were a lot of little things but those are the main observations. I think 
if the code ever matures and they get around to releasing another version of 
OpenSolaris it will be quite nice.


Thoughts in general:

In general I want to say I am VERY impressed with ZFS.  Dedup is not a small 
feature and it interacts with almost every element of the ZFS subsystem and so 
to have a bug in it can cause some catastrophic effects, as we have seen here.  
However, even through I experienced an almost complete failure I was able to 
restore my most valuable data from another FS in the same pool without the aid 
of outside utilities, which is good because I don't know that any exist.  I 
would say this is no small achievement under circumstances and it speaks highly 
of ZFS's natural fault tolerance.  Though this experience has soured me greatly 
in regards to developer releases, I believe I have been come a true believer of 
the ZFS way.  I hope Oracle will continue to invest and mature this technology, 
as there really is nothing else like it out there.
-- 
This message posted from opensolaris.org
_______________________________________________
opensolaris-help mailing list
opensolaris-help@opensolaris.org

Reply via email to