Hi Frank, I have updated the RFE #6588256 with the results of my tests on x86. The tests include Filebench tests, Tar, Find/Cpio and dd with some DTrace metrics showing reduction in number of physical I/Os and larger chunk sizes.
The same tests on SPARC are going on. I have also started the FsFuzzer test after suitable modification and it is happily chugging along - should complete in a few days. I have updated all the info in the description field so that it will be visible publicly via: http://bugs.opensolaris.org/view_bug.do?bug_id=6588256 I have also put up the binaries for AMD64 and SPARC on Genunix should anyone want to play with them: http://www.genunix.org/distributions/belenix_site//binfiles/hsfs.amd64 http://www.genunix.org/distributions/belenix_site//binfiles/hsfs.sparcv9 Regards, Moinak. Moinak Ghosh wrote: > Frank.Hofmann at Sun.COM wrote: >> >> Hi Moinak, >> >> >> final webrev ready ? > > Not yet. Been testing in all sorts of ways. > I have figured a simple way to avoid any performance > degradation for pure random reads by changing when > to call mutex_exit for the non-coalescing case to avail > greater concurrency. > > However after doing that change I was seeing an odd > hang only when booting a BeleniX 0.6.1 LiveCD image > from a DVD (not a CD) on a Ferrari 3400 (not on any > other laptop or desktop) or on MacOSX Parallels ! > > It was caused because of a starvation problem due to the > following loop. I resolved this just now: > > while (sema_tryp(&fio_done[count]) == 0) { > hsched_invoke_strategy(fsp->hqueue); > } > > > > It was caused by thread starvation: > Suppose there are 2 threads who have enqueued their > bufs. Each will call hsched_invoke_strategy repeatedly > till their bufs are processed. > Now sometimes it may happen that the threads are > processing each others bufs. So there can be situation > where there are 2 bufs in the queue and thread A picks > up thread B's buf, thread B picks up threads A's buf. > thread B issues thread A's buf first and get's it back. > Then thread A issues thread B's buf and is waiting for it > to come back. So the list is now empty and thread B keeps > on calling hsched_invoke_strategy which returns > immediately due to an empty list. > > This spin can starve other threads and the I/O may never > complete. I fixed this by adding a return value to > hsched_invoke_strategy so that it returns 1 if the queue > is empty, in which case the caller will simply block on a > sema_p: > > while (sema_tryp(&fio_done[count]) == 0) { > if (hsched_invoke_strategy(fsp->hqueue)) { > sema_p(&fio_done[count]); > break; > } > } > > This had me puzzled for a while. Thanks to Pramod and > Sanjeev for their help. > > I will now add in your suggestions and send out the > updated webrevs. I will also run the fs fuzzer test. > See inline ... > >> [...] >> SDT probe ? >> (no need to change the code just to get that in, just a thought to >> keep in mind if making such changes) > > Okay, good idea will do this, but see below ... > >> >>> >>> Readahead stats will not be 100% accurate in the face of multiple >>> readers on >>> a single file. >>> Readahead is actually called from hsfs_getapage only when a page is >>> satisfied from the cache. This is to prevent unnecessary readahead >>> calls if >>> it is not resulting in actual cache hits. In addition readahead is >>> essentially >>> additional getpage calls being issued ahead of time and so should >>> have the >>> same behavior as when multiple threads are requesting for the same >>> page[s] >>> and they are not in cache. In addition multiple readers on the same >>> file will >>> actually cause the readahead counters to fall back to 0 quite fast. >> >> Ok, was assuming that. If we want reliable statistics, we should hook >> that stuff into a kstat; or just use SDT probes in the relevant points. > > To get a little better reliability would mean adding a lock > and that is bad for performance. The current stuff is giving > a good enough compromise. BTW these stats are internal-use > only, not exposed to user. It might make sense to expose the > number of pages readahead, number pages got from cache > and number of pages got by physical I/O as kstats. Would give > the ability to check readahead effectiveness. > > However I think adding kstats and SDT probes require an > ARC case ? > > If yes then I'd rather do that via another RFE since that will > cause a big additional delay, and, overhead for me to keep > things in sync as more additional features go into hsfs down > the line. > >> >> [ ... ] >>>> - same code, should also zero out the trailing parts of the >>>> non-coalesced >>>> buffers. Or else, kmem_zalloc() 'iodata' to prevent incidental >>>> copyout >>>> of garbage when I/O was only partially successful. Again, in >>>> that sense >>>> _only_ the 'error' case block is correct, since you shouldn't make >>>> assumptions on data within a buf past b_resid on read (or >>>> rather, never >>>> copy such data out to the user). I.e. 'nbuf' contents past b_resid >>>> should be considered invalid. >>>> >>> >>> Actually data is not copied for the errored blocks. It just signals >>> a bioerror. Should >>> the target bufs be zeroed out as well ? >> >> To be sure there's no "leaking" of data into userspace, it might be a >> reasonable thing to do. I think setting b_resid = b_count would >> achieve the same (nothing transferred). Wanted to stress that in >> error case, we shouldn't "leak" stuff that's none of the business of >> the reading app. > > Okay. > >>> [...] >>> Readahead should complete fast enough to be useful, ideally before >>> or a little >>> after the caller comes back to read those pages. So using a slower >>> general >>> purpose queue may not give sufficient throughput, esp. with faster >>> DVD drives. >>> I have seen that in my tests and changed to a dynamic taskq. Since >>> this is dynamic >>> taskq, 8 threads will be created only if required. >>> That said, I see your point, it should be enough to just use one >>> dynamic taskq >>> for all hsfs mounts on the machine with a max of say 16 threads. >>> Since this is >>> dynamic using a larger max should be okay and desired to support a >>> large no. >>> of sequential threads. >> >> The advantage of the system taskq would be that it's generically >> "fair" and doesn't promote HSFS readahead over other system >> activities. Yes, it's important to have it done. >> >> Done a bit of browsing other code ... >> You're right, we can go by precedence. The LOFI driver does a >> 4-thread taskq per lofi device, and noone has ever complained about >> that. So go with the eight threads per fs instance; if we ever >> receive a report about that causing too high a system load, changing >> it to an N-thread-for-all would be easy enough. > > Okay. > >> >>> >>> BTW are there Sunray servers with 100 CDROM drives ? >> >> Every SunRay appliance could (theoretically) have one attached, and >> these show up on the SunRay server as USB devices. I'm not sure what >> customers use in actual deployment, though. >> >>> >>>> - Also, hsched_fini() doesn't make attempts to suspend/wait on the >>>> taskq; >>>> I wonder if the situation could ever arise that we manage to >>>> unmount the >>>> fs while there are still pending/unserviced readahead tasks. If >>>> not, I'd >>>> also like to know what makes sure that situation couldn'r arise. >>>> >>> >>> Won't taskq_destroy wait for the threads to complete ? But then if >>> we use one >>> per-machine taskq then an explicit check needs to be added. >> >> Checked the semantics, you're right taskq_destroy will wait. Thanks >> for the clarification. > > I will add a comment to explain this. > >> >>> >>>> >>>> Generically - I like the code. As said above, I'd like to see some >>>> explanation why the code does [ not ] race in various places. And >>>> the 'partial buf copyout' to be modified not to use b_count but >>>> always b_resid when copying. All else are nits. >>>> >>>> Ah, one final thing 'nits': There's quite a lot of non-cstyle-clean >>>> things in the new code. Pls. cleanup. >>>> >>> >>> Yes, cstyle and lint cleanup needs to be done. >> >> >> :) > > Thanks for the comments. I will update the webrev in a couple > of days; and I will also update the bug description with results > from various tests. > > Regards, > Moinak. > >> >> Best regards, >> FrankH. >> >>> >>> Regards, >>> Moinak. >>> >>>> Thanks, >>>> FrankH. >>>> >>>> >>>>> Regards, >>>>> Moinak. >>>>> >>>> _______________________________________________ >>>> caiman-discuss mailing list >>>> caiman-discuss at opensolaris.org >>>> http://mail.opensolaris.org/mailman/listinfo/caiman-discuss >>>> >>> >>> >> >> ------------------------------------------------------------------------------ >> >> >> No good can come from selling your freedom, not for all the gold in >> the world, >> for the value of this heavenly gift far exceeds that of any fortune >> on earth. >> ------------------------------------------------------------------------------ >> >> > >
