Hi all,

I'm testing an application which makes use of a large mmap file, roughly 
2x the size of physical ram.  I'm seeing the system stall for long 
periods of time, 60+ seconds, and then resume.  The file lives on an SSD 
(Intel x25-e) and I'm using zfs's lzjb compression to make more 
efficient use of the ~30G of space provided by that SSD.

The general flow of things is, start application and ask it to use a 50G 
mmap file.  That file is created in a sparse manner at the location 
designated, then mmap is called on the entire file.  All fine up to this 
point.

I then start loading data into the application, and it starts pushing 
data to the file as you'd expect.  When the applications resident size 
reaches about 80% of the physical ram on the system, the system starts 
paging and things are still working relatively well, though slower as 
expected.

Soon after, when reaching about 40G of data, I get stalls accessing the 
SSD (according to iostat), in other words, no IO to that drive.  When I 
started looking into what could be causing it, such as IO timeouts, I 
run dmesg and it hangs after printing a timestamp.  I can ctrl-c dmesg, 
but subsequent runs provide no better results.  I see no new messages in 
/var/adm/messages, as I'd expect.

Eventually the system recovers, the latest case took over 10 minutes to 
recover, after killing the application mentioned above, and I do see 
disk timeouts in dmesg.

So, I can only assume that there's either a driver bug in the SATA/SAS 
controller I'm using and it's throwing timeouts, or the SSD is having 
issues.  Looking at the zpool configuration, I see that failmode=wait, 
and since that SSD is the only member of the zpool I would expect IO to 
hang.

But, does that mean that dmesg should hang also?  Does that mean that 
the kernel has at least one thread stuck?  Would failmode=continue be 
more desired, or resilient?

During the hang, load-avg is artificially high, fmd being the one 
process that sticks out in prstat output.  But fmdump -v doesn't show 
anything relevant.

Anyone have ideas on how to diagnose what's going on there?

Thanks,
Ethan

System: Sun x4240 dual-amd2347, 32G of ram
SAS/SATA Controller: LSI3081E
OS: osol snv_98
SSD: Intel x25-e

_______________________________________________
opensolaris-discuss mailing list
[email protected]

Reply via email to