Re: how to explain this?

Jens Axboe Mon, 11 Dec 2006 06:38:32 -0800

On Mon, Dec 11 2006, Ming Zhang wrote:
> On Mon, 2006-12-11 at 10:50 +0100, Jens Axboe wrote:
> > On Sun, Dec 10 2006, Ming Zhang wrote:
> > > Today I use blktrace observe a strange (at least to me) behavior at
> > > block layer. Wonder if anybody can shed some lights? Thanks.
> > > 
> > > Here is the detail.
> > > 
> > > ... previous requests are ok.
> > > 
> > >   8,16   0      782     7.025277381  4915  Q   W 6768 + 32 [istiod1]
> > >   8,16   0      783     7.025283850  4915  G   W 6768 + 32 [istiod1]
> > >   8,16   0      784     7.025286799  4915  P   R [istiod1]
> > >   8,16   0      785     7.025287794  4915  I   W 6768 + 32 [istiod1]
> > > 
> > > Write request to lba 6768 was inserted to the queue.
> > > 
> > >   8,16   0      786     7.026059876  4915  Q   R 6768 + 32 [istiod1]
> > >   8,16   0      787     7.026064451  4915  G   R 6768 + 32 [istiod1]
> > >   8,16   0      788     7.026066369  4915  I   R 6768 + 32 [istiod1]
> > > 
> > > Read request to same lba was inserted to the queue as well. though it
> > > can not be merged, i thought it can be satisfied by previous write
> > > request directly. seems merge function does not consider this.
> > 
> > That is the job of the upper layers, typically the page cache. For this
> > scenario to take place, you must be using raw or O_DIRECT. And in that
> > case, it is the job of the application to ensure proper ordering of
> > requests.
> 
> ic. i assumed blkio should take responsibility on this as well. so i am
> wrong.
> 
> > 
> > >   8,16   0      789     7.034883766     0 UT   R [swapper] 2
> > >   8,16   0      790     7.034904284     9  U   R [kblockd/0] 2
> > > 
> > > Unplug because of a read.
> > > 
> > >   8,16   0      791     7.045272094     9  D   R 6768 + 32 [kblockd/0]
> > >   8,16   0      792     7.045654039     9  C   R 6768 + 32 [0]
> > > 
> > > Strangely, read request was sent to device before write request and thus
> > > return a wrong data.
> > 
> > Linux doesn't guarantee any request ordering for O_DIRECT io.
> 
> so this means it can be inserted front and back. and no fixed order?


It'll be sort inserted like any other request. That might be front, it
might be back, or it migth be somewhere in the middle.

> > >   8,16   0      793     7.045669809     9  D   W 6768 + 32 [kblockd/0]
> > >   8,16   0      794     7.049840970     0  C   W 6768 + 32 [0]
> > > 
> > > Write finished.
> > > 
> > > So read get a wrong data back to application. one thing not sure is
> > > where (front/back) the request are insert into queue and who mess up the
> > > order here.
> > 
> > There is no mess up, you are making assumptions that aren't valid.
> > 
> > > Is it possible for I event, we can know the extra flag, so we know where
> > > it is inserted.
> > 
> > That would be too expensive, as we have to peak inside the io scheduler
> > queue. So no.
> 
> see http://lxr.linux.no/source/block/elevator.c?v=2.6.18#L341, here we
> generate insert event and we know where already. so export that flag is
> not expensive.

Maybe we are not talking about the same thing - which flag do you mean?
Do you mean the 'where' position? It'll be ELEVATOR_INSERT_FRONT for
basically any request, unless the issuer specifically asked for BACK or
FRONT. Those are only use in the kernel, or for non-fs request like
SG_IO generated ones. So I don't think the flag will add very much
information that isn't already given.

> > > ---- is the code to generate this io -----. disk is a regular disk and
> > > current scheduler is CFQ.
> > 
> > Ah ok, so you are doing this inside the kernel. If you want to ensure
> > write ordering, then you need to mark the request as a barrier.
> > 
> >         submit_bio(rw || (1 << BIO_RW_BARRIER), bio);
> 
> we tried that if we mark a write request as barrier, we lose half
> performance. if we mark it as BIO_RW_SYNC, it is almost no change.
> though i still need to figure out the reason of that half performance
> loss compared with BIO_RW_SYNC

You lose a lot of performance for writes, as Linux will then also ensure
ordering at the drive level. It does so since just ordering in the
kernel makes little sense, if you allow the drive to reorder at will
anyway. It is possible to control the two parameters, but not from the
bio level. If you mark the bio BIO_RW_BARRIER, then that will get marked
SOFT and HARD barrier in the io scheduler. A soft barrier has ordering
ensured inside the kernel, a hard barrier has ordering in the kernel and
at the drive side as well.

BIO_RW_SYNC doesn't imply any ordering constraints, it just tells the
kernel to make sure that we don't stall plugging the queue.

> > I wont comment on your design, but it seems somewhat strange - why are
> > you doing this in the kernel? What is the segment switching doing?
> 
> we are writing an iscsi target in kernel level.

One already exists :-)

> which segment switching u meant?

The set_fs() stuff around submit_bio() and friends.

-- 
Jens Axboe

-
To unsubscribe from this list: send the line "unsubscribe linux-btrace" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: how to explain this?

Reply via email to