On Sun, Dec 10 2006, Ming Zhang wrote:
> Today I use blktrace observe a strange (at least to me) behavior at
> block layer. Wonder if anybody can shed some lights? Thanks.
>
> Here is the detail.
>
> ... previous requests are ok.
>
> 8,16 0 782 7.025277381 4915 Q W 6768 + 32 [istiod1]
> 8,16 0 783 7.025283850 4915 G W 6768 + 32 [istiod1]
> 8,16 0 784 7.025286799 4915 P R [istiod1]
> 8,16 0 785 7.025287794 4915 I W 6768 + 32 [istiod1]
>
> Write request to lba 6768 was inserted to the queue.
>
> 8,16 0 786 7.026059876 4915 Q R 6768 + 32 [istiod1]
> 8,16 0 787 7.026064451 4915 G R 6768 + 32 [istiod1]
> 8,16 0 788 7.026066369 4915 I R 6768 + 32 [istiod1]
>
> Read request to same lba was inserted to the queue as well. though it
> can not be merged, i thought it can be satisfied by previous write
> request directly. seems merge function does not consider this.
That is the job of the upper layers, typically the page cache. For this
scenario to take place, you must be using raw or O_DIRECT. And in that
case, it is the job of the application to ensure proper ordering of
requests.
> 8,16 0 789 7.034883766 0 UT R [swapper] 2
> 8,16 0 790 7.034904284 9 U R [kblockd/0] 2
>
> Unplug because of a read.
>
> 8,16 0 791 7.045272094 9 D R 6768 + 32 [kblockd/0]
> 8,16 0 792 7.045654039 9 C R 6768 + 32 [0]
>
> Strangely, read request was sent to device before write request and thus
> return a wrong data.
Linux doesn't guarantee any request ordering for O_DIRECT io.
> 8,16 0 793 7.045669809 9 D W 6768 + 32 [kblockd/0]
> 8,16 0 794 7.049840970 0 C W 6768 + 32 [0]
>
> Write finished.
>
> So read get a wrong data back to application. one thing not sure is
> where (front/back) the request are insert into queue and who mess up the
> order here.
There is no mess up, you are making assumptions that aren't valid.
> Is it possible for I event, we can know the extra flag, so we know where
> it is inserted.
That would be too expensive, as we have to peak inside the io scheduler
queue. So no.
> ---- is the code to generate this io -----. disk is a regular disk and
> current scheduler is CFQ.
Ah ok, so you are doing this inside the kernel. If you want to ensure
write ordering, then you need to mark the request as a barrier.
submit_bio(rw || (1 << BIO_RW_BARRIER), bio);
I wont comment on your design, but it seems somewhat strange - why are
you doing this in the kernel? What is the segment switching doing?
BTW, this mail really isn't about blktrace, it probably should have been
sent to the linux-kernel list. You wouldn't send a vmstat observed
problem to the vmstat list, would you? :-)
--
Jens Axboe
-
To unsubscribe from this list: send the line "unsubscribe linux-btrace" in
the body of a message to [EMAIL PROTECTED]
More majordomo info at http://vger.kernel.org/majordomo-info.html