On Mon, 2006-12-11 at 15:50 +0100, Jens Axboe wrote: > On Mon, Dec 11 2006, Ming Zhang wrote: > > On Mon, 2006-12-11 at 15:32 +0100, Jens Axboe wrote: > > > On Mon, Dec 11 2006, Ming Zhang wrote: > > > > On Mon, 2006-12-11 at 10:50 +0100, Jens Axboe wrote: > > > > > On Sun, Dec 10 2006, Ming Zhang wrote: > > > > > > Today I use blktrace observe a strange (at least to me) behavior at > > > > > > block layer. Wonder if anybody can shed some lights? Thanks. > > > > > > > > > > > > Here is the detail. > > > > > > > > > > > > ... previous requests are ok. > > > > > > > > > > > > 8,16 0 782 7.025277381 4915 Q W 6768 + 32 [istiod1] > > > > > > 8,16 0 783 7.025283850 4915 G W 6768 + 32 [istiod1] > > > > > > 8,16 0 784 7.025286799 4915 P R [istiod1] > > > > > > 8,16 0 785 7.025287794 4915 I W 6768 + 32 [istiod1] > > > > > > > > > > > > Write request to lba 6768 was inserted to the queue. > > > > > > > > > > > > 8,16 0 786 7.026059876 4915 Q R 6768 + 32 [istiod1] > > > > > > 8,16 0 787 7.026064451 4915 G R 6768 + 32 [istiod1] > > > > > > 8,16 0 788 7.026066369 4915 I R 6768 + 32 [istiod1] > > > > > > > > > > > > Read request to same lba was inserted to the queue as well. though > > > > > > it > > > > > > can not be merged, i thought it can be satisfied by previous write > > > > > > request directly. seems merge function does not consider this. > > > > > > > > > > That is the job of the upper layers, typically the page cache. For > > > > > this > > > > > scenario to take place, you must be using raw or O_DIRECT. And in that > > > > > case, it is the job of the application to ensure proper ordering of > > > > > requests. > > > > > > > > ic. i assumed blkio should take responsibility on this as well. so i am > > > > wrong. > > > > > > > > > > > > > > > 8,16 0 789 7.034883766 0 UT R [swapper] 2 > > > > > > 8,16 0 790 7.034904284 9 U R [kblockd/0] 2 > > > > > > > > > > > > Unplug because of a read. > > > > > > > > > > > > 8,16 0 791 7.045272094 9 D R 6768 + 32 > > > > > > [kblockd/0] > > > > > > 8,16 0 792 7.045654039 9 C R 6768 + 32 [0] > > > > > > > > > > > > Strangely, read request was sent to device before write request and > > > > > > thus > > > > > > return a wrong data. > > > > > > > > > > Linux doesn't guarantee any request ordering for O_DIRECT io. > > > > > > > > so this means it can be inserted front and back. and no fixed order? > > > > > > It'll be sort inserted like any other request. That might be front, it > > > might be back, or it migth be somewhere in the middle. > > > > ic, so no special treatment here. > > Nope. In fact the block layer and io scheduler do not know that this is > an O_DIRECT request, the bio originates from the same path as any other > regular fs request. > > > > > > > 8,16 0 793 7.045669809 9 D W 6768 + 32 > > > > > > [kblockd/0] > > > > > > 8,16 0 794 7.049840970 0 C W 6768 + 32 [0] > > > > > > > > > > > > Write finished. > > > > > > > > > > > > So read get a wrong data back to application. one thing not sure is > > > > > > where (front/back) the request are insert into queue and who mess > > > > > > up the > > > > > > order here. > > > > > > > > > > There is no mess up, you are making assumptions that aren't valid. > > > > > > > > > > > Is it possible for I event, we can know the extra flag, so we know > > > > > > where > > > > > > it is inserted. > > > > > > > > > > That would be too expensive, as we have to peak inside the io > > > > > scheduler > > > > > queue. So no. > > > > > > > > see http://lxr.linux.no/source/block/elevator.c?v=2.6.18#L341, here we > > > > generate insert event and we know where already. so export that flag is > > > > not expensive. > > > > > > Maybe we are not talking about the same thing - which flag do you mean? > > > Do you mean the 'where' position? It'll be ELEVATOR_INSERT_FRONT for > > > basically any request, unless the issuer specifically asked for BACK or > > > FRONT. Those are only use in the kernel, or for non-fs request like > > > SG_IO generated ones. So I don't think the flag will add very much > > > information that isn't already given. > > > > ic. spawn another question, why almost always ELEVATOR_INSERT_FRONT > > here? why not a fifo queue? or later unplug will drop from the end? i > > forgot the detail. > > Typo, it was supposed to say ELEVATOR_INSERT_SORT!
make sense. so allow to SORT and then specific scheduler will do the work. > > > > > > > ---- is the code to generate this io -----. disk is a regular disk > > > > > > and > > > > > > current scheduler is CFQ. > > > > > > > > > > Ah ok, so you are doing this inside the kernel. If you want to ensure > > > > > write ordering, then you need to mark the request as a barrier. > > > > > > > > > > submit_bio(rw || (1 << BIO_RW_BARRIER), bio); > > > > > > > > we tried that if we mark a write request as barrier, we lose half > > > > performance. if we mark it as BIO_RW_SYNC, it is almost no change. > > > > though i still need to figure out the reason of that half performance > > > > loss compared with BIO_RW_SYNC > > > > > > You lose a lot of performance for writes, as Linux will then also ensure > > > ordering at the drive level. It does so since just ordering in the > > > kernel makes little sense, if you allow the drive to reorder at will > > > anyway. It is possible to control the two parameters, but not from the > > > bio level. If you mark the bio BIO_RW_BARRIER, then that will get marked > > > SOFT and HARD barrier in the io scheduler. A soft barrier has ordering > > > ensured inside the kernel, a hard barrier has ordering in the kernel and > > > at the drive side as well. > > > > > > BIO_RW_SYNC doesn't imply any ordering constraints, it just tells the > > > kernel to make sure that we don't stall plugging the queue. > > > > ic. thanks for explanation. so BIO_RW_SYNC just unplug the queue while > > BIO_RW_BARRIER will ensure the order. then in worst case, BIO_RW_SYNC > > will lead to data inconsistency if 2 overlapped writes coming and order > > is reversed. > > Again, the consistency is in the care of the issuer. For regular file > system io, the page cache will give you this consistency. If you are > issuing bio's directly, you have to take care of this yourself. yes, we do bio directly. so need to worry on this. > > > > > > I wont comment on your design, but it seems somewhat strange - why are > > > > > you doing this in the kernel? What is the segment switching doing? > > > > > > > > we are writing an iscsi target in kernel level. > > > > > > One already exists :-) > > > > en? which one? u mean IET or STGT? we are checking if can adding another > > iomode to IET. some people have fast storage and prefer to bypass the > > page cache completely. > > I haven't tracked which projects exists, just was a scsi target merged > the other day. And I know that there are at least one iscsi stgt > implementation that are catered by some good and experienced Linux > kernel people. > iet. we do file io in kernel as well. also a bad practice. - To unsubscribe from this list: send the line "unsubscribe linux-btrace" in the body of a message to [EMAIL PROTECTED] More majordomo info at http://vger.kernel.org/majordomo-info.html
