Re: [Qemu-block] RFC: some problems with bdrv_get_block_status

Denis V. Lunev Fri, 28 Apr 2017 17:05:03 -0700

On 04/28/2017 09:31 PM, Eric Blake wrote:
> On 04/28/2017 11:17 AM, Denis V. Lunev wrote:
>> Hello, All!
>>
>> Recently we have experienced problems with very slow
>> bdrv_get_block_status call, which is massive called f.e.
>> from mirror_run().
>>
>> The problem was narrowed down to slow lseek64() system
>> call, which can take 1-2-3 seconds.
> I'm guessing you meant one-to-three (the range), not one-two-three
> (three separate digits), and just had an unfortunate abbreviation of
> 'to' turning into the phonetically-similar '2'.
>
from 1 to 3, you are correct



>> root@s158 ~]# strace -f -p 77048 -T  -e lseek
>> Process 77048 attached with 14 threads
>> [pid 77053] lseek(23, 6359613440, SEEK_DATA) = 6360276992 <1.250323>
> That sounds like a bug in your choice of filesystem.  It's been
> mentioned before that lseek has some pathetically poor behavior (I think
> tmpfs was one of the culprits), but I maintain that it is better to
> hammer on the kernel folks to fix the poor behavior than it is to have
> to implement user-space workarounds in every single affected program.
>
> That said, a workaround of being able to request the avoidance of lseek
> has been brought up on this list before.
>
>
>> The problem comes from this branch of the code
>>
>> bdrv_co_get_block_status
>>     .......
>>     if (bs->file &&
>>         (ret & BDRV_BLOCK_DATA) && !(ret & BDRV_BLOCK_ZERO) &&
>>         (ret & BDRV_BLOCK_OFFSET_VALID)) {
>>         int file_pnum;
>>
>>         ret2 = bdrv_co_get_block_status(bs->file, ret >> BDRV_SECTOR_BITS,
>>                                         *pnum, &file_pnum);
>>         if (ret2 >= 0) {
>>             /* Ignore errors.  This is just providing extra information, it
>>              * is useful but not necessary.
>>              */
>>             if (!file_pnum) {
>>                 /* !file_pnum indicates an offset at or beyond the EOF;
>> it is
>>                  * perfectly valid for the format block driver to point
>> to such
>>                  * offsets, so catch it and mark everything as zero */
>>                 ret |= BDRV_BLOCK_ZERO;
>>             } else {
>>                 /* Limit request to the range reported by the protocol
>> driver */
>>                 *pnum = file_pnum;
>>                 ret |= (ret2 & BDRV_BLOCK_ZERO);
>>             }
>>         }
>>     }
> I don't see anything wrong with this code. It's nice to know that we
> have data (qcow2 says we have allocated bytes due to this layer, so
> don't refer to the backing files), but when the underlying file can ALSO
> tell us that the underlying protocol is sparse and we are on a hole,
> then we know that we have BDRV_BLOCK_ZERO condition which can in turn
> enable other optimizations in quite a bit of the stack.  It IS valid to
> return BDRV_BLOCK_DATA | BDRV_BLOCK_ZERO (we don't have to limit
> ourselves to BDRV_BLOCK_DATA), and we should probably do so any time
> that such computation is cheap.
>
> I agree with your analysis that on a poor filesystem where lseek()
> behavior sucks that it is no longer cheap to determine where the holes are.
>
> But the proper workaround for those filesystems should NOT be made by
> undoing this part of bdrv_co_get_block_status(), but should rather be
> fixed in file-posix.c by the addition of a user-controllable knob on
> whether to skip lseek().  In other words, if we're going to work around
> the poor filesystem performance, the workaround should live in
> file-posix.c, not in the generic block/io.c.  Once the workaround is
> enabled, the 'ret2 = bdrv_co_get_block_status()' nested call should be
> lightning fast (because file-posix.c would no longer be using lseek when
> you've requested the workaround).
>
>> Frankly speaking, this optimization should not give much.
>> If upper layer format (QCOW2) says that we have data
>> here, then nowadays in 99.9% we do have the data.
> You are correct that we have BDRV_BLOCK_DATA; but it IS useful to ALSO
> know if that data reads back as all zeros (so we can skip reading it).
> Just because qcow2 reports something as data does NOT rule out whether
> the data still reads as zeros.
>
>> Meanwhile this branch can cause problems. We would
>> need block cleared entirely to get the benefit for most
>> cases in mirror and backup operations.
>>
>> At my opinion it worth to drop this at all.
>>
>> Guys, do you have any opinion?
> Again, my opinion is to not change this part of block/io.c.  Rather,
> work should be expended on individual protocol drivers to quit wasting
> unnecessary time on reporting BDRV_BLOCK_ZERO if the time spent to
> determine that is not worth the effort (as it is when using lseek() on
> inefficient filesystems).  It is always safe for a protocol driver to
> report BDRV_BLOCK_DATA instead of BDRV_BLOCK_ZERO, if asking the
> question is too expensive (treating holes as data may pessimize other
> operations, but that's okay if the penalty for asking is worse than what
> optimizations are able to regain).
>
>> Den
>>
>> P.S. The kernel is one based on RedHat 3.10.0-514. The same
>>       problem was observed in 3.10.0-327 too.
> Red Hat is always two words (I'm allowed to point that out, as they
> employ me ;).  But if it really is a Red Hat kernel problem, be sure to
> use your support contract to point out the kernel developers that they
> really need to fix lseek() - and you'll need to give details on which
> filesystem you're using (as not all filesystems have that abysmal
> performance).
>

Re: [Qemu-block] RFC: some problems with bdrv_get_block_status

Reply via email to