-----Original Message-----
From: Theodore Ts'o [mailto:[email protected]]
Sent: Thursday, January 5, 2017 5:12 PM
To: Slava Dubeyko <[email protected]>
Cc: Damien Le Moal <[email protected]>; Matias Bjørling <[email protected]>;
Viacheslav Dubeyko <[email protected]>; [email protected];
Linux FS Devel <[email protected]>; [email protected];
[email protected]
Subject: Re: [LSF/MM TOPIC][LSF/MM ATTEND] OCSSDs - SMR, Hierarchical
Interface, and Vector I/Os
<skipped>
> I think you've been thinking about a model where *either* the host as
> complete control
> over all aspects of the flash management, or the FTL has complete control ---
> and it may
> be that there are more clever ways that the work could be split between
> flash device and the host OS.
Yes, I totally agree that the better way is to split different responsibilities
between the flash
device and the host (file system, for example). I would like to consider an SSD
device as a set
of FTL primitives. Let's imagine the SSD like an automata that is able to
execute FTL primitives
but the file system issues the commands orchestrate the SSD activity. I believe
it makes sense
to think about SSD like data processing accelerator engine. It means that we
need in good
interface that can be the basis for the offload of data processing operations.
And I clearly see
the many cases when a file system would like to say: "Hey, SSD. Please, execute
this primitive
for me right now".
Let's consider operations of moving zones (or erase blocks) with high BER.
If we have completely passive SSD then it sounds for me that all operations
will look like:
(1) read data on the host side; (2) "reset" zone; (3) write data into the
SSD backwards. But if we talk about some zone (erase block(s)) with high BER is
full of valid data
then why does host need to execute the whole operation in such stupid way like
"read-write"?
I mean that it completely doesn't make sense to spend the host's resources for
such operation.
The responsibility of the host is simply to initiate such operation in the
proper time. And responsibility
of the SSD is to execute such operation internally (offload of the operation).
So, here we could
have the FTL primitive of moving of zones (erase blocks) for overcoming the
read disturbance.
Let's consider GC operations... Right now, we have GC subsystem on the SSD side
(device-managed and
host-aware case) and we have GC subsystem on the host side (LFS file systems of
host-aware case).
So, it's clear that SSD device is able to provide some primitives of GC
operations. Also it's completely
unreasonable to have GC subsystem as on SSD side as on the host side. If we
have GC subsystem on
the host only then we need to follow the stupid paradigm "read-modify-write"
and to spend the host's
resources for GC operations. Otherwise, if GC subsystem on the SSD side then GC
suffers from lack of
knowledge about valid data location (file system keeps this knowledge) and such
solution provides
wide range of cases for unexpected performance degradation. So, we need in much
smarter solution.
What could it be?
Again, file system (host) has to initiate the GC operation in proper time but
the SSD should execute
the requested operation (offload of the operation). So, we will have the GC
subsystem on file system
side but the real GC operation under zone (erase block(s)) will be executed by
SSD device. The key point
here that: (1) file system choses the good time for GC operation; (2) file
system is able to select a zone
(erase block(s)) that provides then cost-efficient way of GC activity from the
point of view of valid data
amount in the aged zone; (3) file system shares information about valid pages
in the zone (erase block(s));
(4) SSD executes GC operation under the zone internally.
We need to take into account three possible cases: (1) zone is completely
invalid; (2) zone is partially
invalid; (3) zone contains valid data only. If file system's GC selects a zone
that doesn't contain valid data
("invalid" zone case) then GC simply needs to request "reset" zone or send TRIM
command. The rest is
responsibility of SSD device. If zone is completely filled by valid data then
file system's GC needs to
request moving operation on the SSD side. If we will use a virtual zones then
it means that such moving
operation on the SSD side will change nothing for the file system (logical
block numbers will be the same).
So, file system doesn't need to change internal mapping table for such
operation.
The case of partially invalid zone (contains some amount of valid data) is more
tricky. But let's consider
the situation. If file system has knowledge about position of valid logical
blocks or pages inside a zone
then the file system is able to share a zone's bitmap with SSD device. It means
that if we have 4 KB
logical block and 256 MB zone then we need in 8 KB bitmap for representing
positions of valid
logical blocks inside of the zone. So, file system is able to send such valid
pages' bitmap with the
command of GC operation initiation for some zone. The responsibility of SSD
side will be: (1) "reset"
zone; (2) move the valid logical blocks from aged zone into new ones with
compaction scheme using.
I mean that all valid pages should be written in contiguous manner in the newly
allocated zone
(erase blocks). Finally, it means that SSD device can reposition logical blocks
inside of the zone
without changing the initial order of logical pages (compaction scheme). Such
compaction scheme
can be easily implemented on the SSD side. And if we will not change the order
of logical blocks
then we have deterministic case that can be easily processed on file system
side. If file system has
initial bitmap then it can easily re-calculate the valid logical blocks'
position after compaction scheme
using. For example, F2FS can easily do such re-calculation. Finally, new values
of valid logical blocks'
position should be stored into file system's mapping table. NILFS2 is slightly
more complex case.
Because, NILFS2 describes logical blocks inside of the log by means of special
btree in the log's header.
So, again, compaction scheme is deterministic case that provides opportunity to
re-calculate the
logical blocks' position before real GC operation. It means that NILFS2 is able
to prepare as valid
logical blocks' bitmap as log's header before GC operation and to share all
these stuff with SSD device.
However, every GC operation under partially invalid zone is resulted in
creation of zone that will be
partially filled by valid data (the rest of zone will be completely free). What
does it need to do in such
case? I can see the four possible approaches:
(1) Re-use the partially filled zone. If file system will track the state of
every zone (mapping table,
for example) or it will be possible to extract the state of zone then it means
that aged zone will
change the state after GC operation. So, partially filled zone can be used as
current zone for
writing a new data.
(2) Add valid data of aged zone into the tail of current zone. Let's imagine
that file system is using
some zone as current zone for adding a new data. If we know that an aged zone
contains some
number of valid pages then it's possible to reserve the space in the tail of
current zone. Finally,
it is possible to initiate combine flush operation (write data from page cache
of current zone)
with GC operation under aged zone on the SSD side.
(3) Re-use aged zone as current zone. Let's imagine that we have some aged zone
with small
number of valid pages. It means that we can select this zone as current zone
for a new data.
First of all, we need: (1) "reset" zone; (2) initiate GC operation on the SSD
device side. We know
how many valid pages we will have in the beginning of the current zone. So, we
simply needs
to add a new logical blocks into page cache of current zone after reserved area
of data from
aged zone. So, our GC operation will be in the background of a new data
preparation in the
page cache of current zone. And, finally, we will have the whole zone is full
of data after
flush operation.
(4) Merge several aged zones into new one.
> It's much better to use an abstraction such as Zones, and then have an
> abstraction layer
> that hides the low-level details of the hardware from the OS.
> The trick is picking an abstraction that exposes the _right_ set of details
> so that the division
> of labor between the Host OS and the storage device is at a better place.
> Hence my suggestion
> of perhaps providing a virtual mapping layer between "Zone number" and
> the low-level physical erase block.
I like the idea of some abstraction that hides the low-level details. But it
sounds that we still
will have two mapping tables on SSD side and file system side. Again we needs
in distribution
the responsibilities between the file system and SSD device. If file system
will manage GC activity
but the real GC operation will be delegated on SSD side (in proper time) then
it sounds that
all maintenance operations will be done by SSD itself. It means that SSD device
is able to manage
only one mapping table and file system simply needs to have actual copy of the
mapping table.
Or, oppositely, file system can manage only one mapping table and to share the
actual state
with the SSD device. But one mapping table looks like as really complicated
technique. From
another point of view, virtual zone can have the same ID always. So, the
responsibility of the
SSD device will be mapping the virtual zone ID with physical erase block IDs.
Such mapping
table (virtual zone ID <-> erase block(s)) can be more compact as mapping table
(LBA <->
physical page). The responsibility of file system (host) will be the mapping
inside of
the virtual zone (LBA <-> logical block inside the virtual zone). If the
virtual zone ID will be
always the same then such mapping table could be lesser in size. But I don't
see how
such mapping table can be lesser in size for the current implementation of F2FS
or NILFS2.
However, let's imagine that log will be equal to the whole zone then the header
of the log
can include likewise mapping table for the log/zone.
Thanks,
Vyacheslav Dubeyko.
--
To unsubscribe from this list: send the line "unsubscribe linux-block" in
the body of a message to [email protected]
More majordomo info at http://vger.kernel.org/majordomo-info.html