On Wed, Aug 19, 2020 at 4:47 PM Damien Le Moal <[email protected]> wrote: > > On 2020/08/19 19:32, Kanchan Joshi wrote: > > On Wed, Aug 19, 2020 at 3:08 PM Damien Le Moal <[email protected]> > > wrote: > >> > >> On 2020/08/19 18:27, Kanchan Joshi wrote: > >>> On Tue, Aug 18, 2020 at 12:46 PM Christoph Hellwig <[email protected]> wrote: > >>>> > >>>> On Tue, Aug 18, 2020 at 10:59:35AM +0530, Kanchan Joshi wrote: > >>>>> Set elevator feature ELEVATOR_F_ZBD_SEQ_WRITE required for ZNS. > >>>> > >>>> No, it is not. > >>> > >>> Are you saying MQ-Deadline (write-lock) is not needed for writes on ZNS? > >>> I see that null-block zoned and SCSI-ZBC both set this requirement. I > >>> wonder how it became different for NVMe. > >> > >> It is not required for an NVMe ZNS drive that has zone append native > >> support. > >> zonefs and upcoming btrfs do not use regular writes, removing the > >> requirement > >> for zone write locking. > > > > I understand that if a particular user (zonefs, btrfs etc) is not > > sending regular-write and sending append instead, write-lock is not > > required. > > But if that particular user or some other user (say F2FS) sends > > regular write(s), write-lock is needed. > > And that can be trivially enabled by setting the drive elevator to > mq-deadline. > > > Above block-layer, both the opcodes REQ_OP_WRITE and > > REQ_OP_ZONE_APPEND are available to be used by users. And I thought > > write-lock is taken or not is a per-opcode thing and not per-user (FS, > > MD/DM, user-space etc.), is not that correct? And MQ-deadline can > > cater to both the opcodes, while other schedulers cannot serve > > REQ_OP_WRITE well for zoned-device. > > mq-deadline ignores zone append commands. No zone lock is taken for these. In > scsi, the emulation takes the zone lock before transforming the zone append > into > a regular write. That locking is consistent with the mq-scheduler level > locking > since the same lock bitmap is used. So if the user only issues zone append > writes, mq-deadline is not needed and there is no reasons to force its use by > setting ELEVATOR_F_ZBD_SEQ_WRITE. E.g. the user may want to use kyber...
Right, got your point. > >> In the context of your patch series, ELEVATOR_F_ZBD_SEQ_WRITE should be > >> set only > >> and only if the drive does not have native zone append support. > > > > Sure I can keep it that way, once I get it right. If it is really not > > required for native-append drive, it should not be here at the place > > where I added. > > > >> And even in that > >> case, since for an emulated zone append the zone write lock is taken and > >> released by the emulation driver itself, ELEVATOR_F_ZBD_SEQ_WRITE is > >> required > >> only if the user will also be issuing regular writes at high QD. And that > >> is > >> trivially controllable by the user by simply setting the drive elevator to > >> mq-deadline. Conclusion: setting ELEVATOR_F_ZBD_SEQ_WRITE is not needed. > > > > Are we saying applications should switch schedulers based on the write > > QD (use any-scheduler for QD1 and mq-deadline for QD-N). > > Even if it does that, it does not know what other applications would > > be doing. That seems hard-to-get-right and possible only in a > > tightly-controlled environment. > > Even for SMR, the user is free to set the elevator to none, which disables > zone > write locking. Issuing writes correctly then becomes the responsibility of the > application. This can be useful for settings that for instance use NCQ I/O > priorities, which give better results when "none" is used. Was it not a problem that even if the application is sending writes correctly, scheduler may not preserve the order. And even when none is being used, re-queue can happen which may lead to different ordering. > As far as I know, zoned drives are always used in tightly controlled > environments. Problems like "does not know what other applications would be > doing" are non-existent. Setting up the drive correctly for the use case at > hand > is a sysadmin/server setup problem, based on *the* application (singular) > requirements. Fine. But what about the null-block-zone which sets MQ-deadline but does not actually use write-lock to avoid race among multiple appends on a zone. Does that deserve a fix? -- Kanchan

