On Mon, 2006-12-11 at 15:50 +0100, Jens Axboe wrote:
> On Mon, Dec 11 2006, Ming Zhang wrote:
> > On Mon, 2006-12-11 at 15:32 +0100, Jens Axboe wrote:
> > > On Mon, Dec 11 2006, Ming Zhang wrote:
> > > > On Mon, 2006-12-11 at 10:50 +0100, Jens Axboe wrote:
> > > > > On Sun, Dec 10 2006, Ming Zhang wrote:
> > > > > > Today I use blktrace observe a strange (at least to me) behavior at
> > > > > > block layer. Wonder if anybody can shed some lights? Thanks.
> > > > > > 
> > > > > > Here is the detail.
> > > > > > 
> > > > > > ... previous requests are ok.
> > > > > > 
> > > > > >   8,16   0      782     7.025277381  4915  Q   W 6768 + 32 [istiod1]
> > > > > >   8,16   0      783     7.025283850  4915  G   W 6768 + 32 [istiod1]
> > > > > >   8,16   0      784     7.025286799  4915  P   R [istiod1]
> > > > > >   8,16   0      785     7.025287794  4915  I   W 6768 + 32 [istiod1]
> > > > > > 
> > > > > > Write request to lba 6768 was inserted to the queue.
> > > > > > 
> > > > > >   8,16   0      786     7.026059876  4915  Q   R 6768 + 32 [istiod1]
> > > > > >   8,16   0      787     7.026064451  4915  G   R 6768 + 32 [istiod1]
> > > > > >   8,16   0      788     7.026066369  4915  I   R 6768 + 32 [istiod1]
> > > > > > 
> > > > > > Read request to same lba was inserted to the queue as well. though 
> > > > > > it
> > > > > > can not be merged, i thought it can be satisfied by previous write
> > > > > > request directly. seems merge function does not consider this.
> > > > > 
> > > > > That is the job of the upper layers, typically the page cache. For 
> > > > > this
> > > > > scenario to take place, you must be using raw or O_DIRECT. And in that
> > > > > case, it is the job of the application to ensure proper ordering of
> > > > > requests.
> > > > 
> > > > ic. i assumed blkio should take responsibility on this as well. so i am
> > > > wrong.
> > > > 
> > > > > 
> > > > > >   8,16   0      789     7.034883766     0 UT   R [swapper] 2
> > > > > >   8,16   0      790     7.034904284     9  U   R [kblockd/0] 2
> > > > > > 
> > > > > > Unplug because of a read.
> > > > > > 
> > > > > >   8,16   0      791     7.045272094     9  D   R 6768 + 32 
> > > > > > [kblockd/0]
> > > > > >   8,16   0      792     7.045654039     9  C   R 6768 + 32 [0]
> > > > > > 
> > > > > > Strangely, read request was sent to device before write request and 
> > > > > > thus
> > > > > > return a wrong data.
> > > > > 
> > > > > Linux doesn't guarantee any request ordering for O_DIRECT io.
> > > > 
> > > > so this means it can be inserted front and back. and no fixed order?
> > > 
> > > It'll be sort inserted like any other request. That might be front, it
> > > might be back, or it migth be somewhere in the middle.
> > 
> > ic, so no special treatment here.
> 
> Nope. In fact the block layer and io scheduler do not know that this is
> an O_DIRECT request, the bio originates from the same path as any other
> regular fs request.
> 
> > > > > >   8,16   0      793     7.045669809     9  D   W 6768 + 32 
> > > > > > [kblockd/0]
> > > > > >   8,16   0      794     7.049840970     0  C   W 6768 + 32 [0]
> > > > > > 
> > > > > > Write finished.
> > > > > > 
> > > > > > So read get a wrong data back to application. one thing not sure is
> > > > > > where (front/back) the request are insert into queue and who mess 
> > > > > > up the
> > > > > > order here.
> > > > > 
> > > > > There is no mess up, you are making assumptions that aren't valid.
> > > > > 
> > > > > > Is it possible for I event, we can know the extra flag, so we know 
> > > > > > where
> > > > > > it is inserted.
> > > > > 
> > > > > That would be too expensive, as we have to peak inside the io 
> > > > > scheduler
> > > > > queue. So no.
> > > > 
> > > > see http://lxr.linux.no/source/block/elevator.c?v=2.6.18#L341, here we
> > > > generate insert event and we know where already. so export that flag is
> > > > not expensive.
> > > 
> > > Maybe we are not talking about the same thing - which flag do you mean?
> > > Do you mean the 'where' position? It'll be ELEVATOR_INSERT_FRONT for
> > > basically any request, unless the issuer specifically asked for BACK or
> > > FRONT. Those are only use in the kernel, or for non-fs request like
> > > SG_IO generated ones. So I don't think the flag will add very much
> > > information that isn't already given.
> > 
> > ic. spawn another question, why almost always ELEVATOR_INSERT_FRONT
> > here? why not a fifo queue? or later unplug will drop from the end? i
> > forgot the detail.
> 
> Typo, it was supposed to say ELEVATOR_INSERT_SORT!

make sense. so allow to SORT and then specific scheduler will do the
work.


> 
> > > > > > ---- is the code to generate this io -----. disk is a regular disk 
> > > > > > and
> > > > > > current scheduler is CFQ.
> > > > > 
> > > > > Ah ok, so you are doing this inside the kernel. If you want to ensure
> > > > > write ordering, then you need to mark the request as a barrier.
> > > > > 
> > > > >         submit_bio(rw || (1 << BIO_RW_BARRIER), bio);
> > > > 
> > > > we tried that if we mark a write request as barrier, we lose half
> > > > performance. if we mark it as BIO_RW_SYNC, it is almost no change.
> > > > though i still need to figure out the reason of that half performance
> > > > loss compared with BIO_RW_SYNC
> > > 
> > > You lose a lot of performance for writes, as Linux will then also ensure
> > > ordering at the drive level. It does so since just ordering in the
> > > kernel makes little sense, if you allow the drive to reorder at will
> > > anyway. It is possible to control the two parameters, but not from the
> > > bio level. If you mark the bio BIO_RW_BARRIER, then that will get marked
> > > SOFT and HARD barrier in the io scheduler. A soft barrier has ordering
> > > ensured inside the kernel, a hard barrier has ordering in the kernel and
> > > at the drive side as well.
> > > 
> > > BIO_RW_SYNC doesn't imply any ordering constraints, it just tells the
> > > kernel to make sure that we don't stall plugging the queue.
> > 
> > ic. thanks for explanation. so BIO_RW_SYNC just unplug the queue while
> > BIO_RW_BARRIER will ensure the order. then in worst case, BIO_RW_SYNC
> > will lead to data inconsistency if 2 overlapped writes coming and order
> > is reversed.
> 
> Again, the consistency is in the care of the issuer. For regular file
> system io, the page cache will give you this consistency. If you are
> issuing bio's directly, you have to take care of this yourself.

yes, we do bio directly. so need to worry on this.


> 
> > > > > I wont comment on your design, but it seems somewhat strange - why are
> > > > > you doing this in the kernel? What is the segment switching doing?
> > > > 
> > > > we are writing an iscsi target in kernel level.
> > > 
> > > One already exists :-)
> > 
> > en? which one? u mean IET or STGT? we are checking if can adding another
> > iomode to IET. some people have fast storage and prefer to bypass the
> > page cache completely.
> 
> I haven't tracked which projects exists, just was a scsi target merged
> the other day. And I know that there are at least one iscsi

stgt

> implementation that are catered by some good and experienced Linux
> kernel people.
> 

iet. we do file io in kernel as well. also a bad practice.


-
To unsubscribe from this list: send the line "unsubscribe linux-btrace" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to