[caiman-discuss] hsfs readahead changes, review comments/feedback

Moinak Ghosh Thu, 06 Sep 2007 19:21:04 -0600

Hi Frank,

   I have updated the RFE #6588256 with the results of my tests on
x86. The tests include Filebench tests, Tar, Find/Cpio and dd with
some DTrace metrics showing reduction in number of physical I/Os
and larger chunk sizes.


The same tests on SPARC are going on. I have also started the
FsFuzzer test after suitable modification and it is happily chugging
along - should complete in a few days.

I have updated all the info in the description field so that it will be
visible publicly via:
http://bugs.opensolaris.org/view_bug.do?bug_id=6588256

I have also put up the binaries for AMD64 and SPARC on Genunix
should anyone want to play with them:

http://www.genunix.org/distributions/belenix_site//binfiles/hsfs.amd64
http://www.genunix.org/distributions/belenix_site//binfiles/hsfs.sparcv9

Regards,
Moinak.

Moinak Ghosh wrote:
> Frank.Hofmann at Sun.COM wrote:
>>
>> Hi Moinak,
>>
>>
>> final webrev ready ?
>
>   Not yet. Been testing in all sorts of ways.
>   I have figured a simple way to avoid any performance
>   degradation for pure random reads by changing when
>   to call mutex_exit for the non-coalescing case to avail
>   greater concurrency.
>
>   However after doing that change I was seeing an odd
>   hang only when booting a BeleniX 0.6.1 LiveCD image
>   from a DVD (not a CD) on a Ferrari 3400 (not on any
>   other laptop or desktop) or on MacOSX Parallels !
>
>   It was caused because of a starvation problem due to the
>   following loop. I resolved this just now:
>
>   while (sema_tryp(&fio_done[count]) == 0) {
>           hsched_invoke_strategy(fsp->hqueue);
>   }
>
>  
>
>   It was caused by thread starvation:
>   Suppose there are 2 threads who have enqueued their
>   bufs. Each will call hsched_invoke_strategy repeatedly
>   till their bufs are processed.
>   Now sometimes it may happen that the threads are
>   processing each others bufs. So there can be situation
>   where there are 2 bufs in the queue and thread A picks
>   up thread B's buf, thread B picks up threads A's buf.
>   thread B issues thread A's buf  first and get's it back.
>   Then thread A issues thread B's buf and is waiting for it
>   to come back. So the list is now empty and thread B keeps
>   on calling hsched_invoke_strategy which returns
>   immediately due to an empty list.
>
>   This spin can starve other threads and the I/O may never
>   complete. I fixed this by adding a return value to
>   hsched_invoke_strategy so that it returns 1 if the queue
>   is empty, in which case the caller will simply block on a
>   sema_p:
>
>   while (sema_tryp(&fio_done[count]) == 0) {
>           if (hsched_invoke_strategy(fsp->hqueue)) {
>                   sema_p(&fio_done[count]);
>                   break;
>           }
>   }
>
>   This had me puzzled for a while. Thanks to Pramod and
>   Sanjeev for their help.
>
>   I will now add in your suggestions and send out the
>   updated webrevs. I will also run the fs fuzzer test.
>   See inline ...
>
>> [...]
>> SDT probe ?
>> (no need to change the code just to get that in, just a thought to 
>> keep in mind if making such changes)
>
>   Okay, good idea will do this, but see below ...
>
>>
>>>
>>>  Readahead stats will not be 100% accurate in the face of multiple 
>>> readers on
>>>  a single file.
>>>  Readahead is actually called from hsfs_getapage only when a page is
>>>  satisfied from the cache. This is to prevent unnecessary readahead 
>>> calls if
>>>  it is not resulting in actual cache hits. In addition readahead is 
>>> essentially
>>>  additional getpage calls being issued ahead of time and so should 
>>> have the
>>>  same behavior as when multiple threads are requesting for the same 
>>> page[s]
>>>  and they are not in cache. In addition multiple readers on the same 
>>> file will
>>>  actually cause the readahead counters to fall back to 0 quite fast.
>>
>> Ok, was assuming that. If we want reliable statistics, we should hook 
>> that stuff into a kstat; or just use SDT probes in the relevant points.
>
>   To get a little better reliability would mean adding a lock
>   and that is bad for performance. The current stuff is giving
>   a good enough compromise. BTW these stats are internal-use
>   only, not exposed to user. It might make sense to expose the
>   number of pages readahead, number pages got from cache
>   and number of pages got by physical I/O as kstats. Would give
>   the ability to check readahead effectiveness.
>
>   However I think adding kstats and SDT probes require an
>   ARC case ?
>
>   If yes then I'd rather do that via another RFE since that will
>   cause a big additional delay, and, overhead for me to keep
>   things in sync as more additional features go into hsfs down
>   the line.
>
>>
>> [ ... ]
>>>> - same code, should also zero out the trailing parts of the 
>>>> non-coalesced
>>>>    buffers. Or else, kmem_zalloc() 'iodata' to prevent incidental 
>>>> copyout
>>>>    of garbage when I/O was only partially successful. Again, in 
>>>> that sense
>>>>    _only_ the 'error' case block is correct, since you shouldn't make
>>>>    assumptions on data within a buf past b_resid on read (or 
>>>> rather, never
>>>>    copy such data out to the user). I.e. 'nbuf' contents past b_resid
>>>>    should be considered invalid.
>>>>
>>>
>>>  Actually data is not copied for the errored blocks. It just signals 
>>> a bioerror. Should
>>>  the target bufs be zeroed out as well ?
>>
>> To be sure there's no "leaking" of data into userspace, it might be a 
>> reasonable thing to do. I think setting b_resid = b_count would 
>> achieve the same (nothing transferred). Wanted to stress that in 
>> error case, we shouldn't "leak" stuff that's none of the business of 
>> the reading app.
>
>   Okay.
>
>>> [...]
>>>  Readahead should complete fast enough to be useful, ideally before 
>>> or a little
>>>  after the caller comes back to read those pages. So using a slower 
>>> general
>>>  purpose queue may not give sufficient throughput, esp. with faster 
>>> DVD drives.
>>>  I have seen that in my tests and changed to a dynamic taskq. Since 
>>> this is dynamic
>>>  taskq, 8 threads will be created only if required.
>>>  That said, I see your point, it should be enough to just use one 
>>> dynamic taskq
>>>  for all hsfs mounts on the machine with a max of say 16 threads. 
>>> Since this is
>>>  dynamic using a larger max should be okay and desired to support a 
>>> large no.
>>>  of sequential threads.
>>
>> The advantage of the system taskq would be that it's generically 
>> "fair" and doesn't promote HSFS readahead over other system 
>> activities. Yes, it's important to have it done.
>>
>> Done a bit of browsing other code ...
>> You're right, we can go by precedence. The LOFI driver does a 
>> 4-thread taskq per lofi device, and noone has ever complained about 
>> that. So go with the eight threads per fs instance; if we ever 
>> receive a report about that causing too high a system load, changing 
>> it to an N-thread-for-all would be easy enough.
>
>   Okay.
>
>>
>>>
>>>  BTW are there Sunray servers with 100 CDROM drives ?
>>
>> Every SunRay appliance could (theoretically) have one attached, and 
>> these show up on the SunRay server as USB devices. I'm not sure what 
>> customers use in actual deployment, though.
>>
>>>
>>>> - Also, hsched_fini() doesn't make attempts to suspend/wait on the 
>>>> taskq;
>>>>    I wonder if the situation could ever arise that we manage to 
>>>> unmount the
>>>>    fs while there are still pending/unserviced readahead tasks. If 
>>>> not, I'd
>>>>    also like to know what makes sure that situation couldn'r arise.
>>>>
>>>
>>>  Won't taskq_destroy wait for the threads to complete ? But then if 
>>> we use one
>>>  per-machine taskq then an explicit check needs to be added.
>>
>> Checked the semantics, you're right taskq_destroy will wait. Thanks 
>> for the clarification.
>
>   I will add a comment to explain this.
>
>>
>>>
>>>>
>>>> Generically - I like the code. As said above, I'd like to see some 
>>>> explanation why the code does [ not ] race in various places. And 
>>>> the 'partial buf copyout' to be modified not to use b_count but 
>>>> always b_resid when copying. All else are nits.
>>>>
>>>> Ah, one final thing 'nits': There's quite a lot of non-cstyle-clean 
>>>> things in the new code. Pls. cleanup.
>>>>
>>>
>>>  Yes, cstyle and lint cleanup needs to be done.
>>
>>
>> :)
>
>  Thanks for the comments. I will update the webrev in a couple
>  of days; and I will also update the bug description with results
>  from various tests.
>
> Regards,
> Moinak.
>
>>
>> Best regards,
>> FrankH.
>>
>>>
>>> Regards,
>>> Moinak.
>>>
>>>> Thanks,
>>>> FrankH.
>>>>
>>>>
>>>>> Regards,
>>>>> Moinak.
>>>>>
>>>> _______________________________________________
>>>> caiman-discuss mailing list
>>>> caiman-discuss at opensolaris.org
>>>> http://mail.opensolaris.org/mailman/listinfo/caiman-discuss
>>>>
>>>
>>>
>>
>> ------------------------------------------------------------------------------
>>  
>>
>> No good can come from selling your freedom, not for all the gold in 
>> the world,
>> for the value of this heavenly gift far exceeds that of any fortune 
>> on earth.
>> ------------------------------------------------------------------------------
>>  
>>
>
>

[caiman-discuss] hsfs readahead changes, review comments/feedback

Reply via email to