Hi all, I'm testing an application which makes use of a large mmap file, roughly 2x the size of physical ram. I'm seeing the system stall for long periods of time, 60+ seconds, and then resume. The file lives on an SSD (Intel x25-e) and I'm using zfs's lzjb compression to make more efficient use of the ~30G of space provided by that SSD.
The general flow of things is, start application and ask it to use a 50G mmap file. That file is created in a sparse manner at the location designated, then mmap is called on the entire file. All fine up to this point. I then start loading data into the application, and it starts pushing data to the file as you'd expect. When the applications resident size reaches about 80% of the physical ram on the system, the system starts paging and things are still working relatively well, though slower as expected. Soon after, when reaching about 40G of data, I get stalls accessing the SSD (according to iostat), in other words, no IO to that drive. When I started looking into what could be causing it, such as IO timeouts, I run dmesg and it hangs after printing a timestamp. I can ctrl-c dmesg, but subsequent runs provide no better results. I see no new messages in /var/adm/messages, as I'd expect. Eventually the system recovers, the latest case took over 10 minutes to recover, after killing the application mentioned above, and I do see disk timeouts in dmesg. So, I can only assume that there's either a driver bug in the SATA/SAS controller I'm using and it's throwing timeouts, or the SSD is having issues. Looking at the zpool configuration, I see that failmode=wait, and since that SSD is the only member of the zpool I would expect IO to hang. But, does that mean that dmesg should hang also? Does that mean that the kernel has at least one thread stuck? Would failmode=continue be more desired, or resilient? During the hang, load-avg is artificially high, fmd being the one process that sticks out in prstat output. But fmdump -v doesn't show anything relevant. Anyone have ideas on how to diagnose what's going on there? Thanks, Ethan System: Sun x4240 dual-amd2347, 32G of ram SAS/SATA Controller: LSI3081E OS: osol snv_98 SSD: Intel x25-e _______________________________________________ opensolaris-discuss mailing list [email protected]
