[
https://issues.apache.org/jira/browse/HDFS-9806?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15295776#comment-15295776
]
Chris Douglas commented on HDFS-9806:
-------------------------------------
Great questions.
bq. the way we interpreted the document, the external (provided) storage is the
source of truth so any changes there should be updated in HDFS and any
inconsistencies that arise would favour the external store
This is accurate, particularly of the first implementation. When it supports
writes to provided storage, HDFS may be declared as the source of truth when it
conflicts with the external store, but solving this generally for generic
external stores seems untenable. Similarly, rules for merging conflicts are
rarely general e.g., concurrent updates to Hive can't be resolved by the
filesystem, but might be reconcilable by that system. The design assumes that
cycles are rare, and most tiered clusters will be either producers or consumers
for a subtree/region.
bq. If the Namenode is accessing the PROVIDED storage to update its mapping
shouldn’t it also update the nonce data at the same time and instruct the
datanode to refresh too? Or is the intention for the Namenode to only update
the directory information and not the actual nonce data for the files? (If so,
how could the Namenode apply heuristics to detect “promoting output to a parent
directory”?).
Yes, the NN is also responsible for maintaining the nonce. The DN will refuse
to fetch the block from the external store when it doesn't match the recorded
metadata. In general, the NN can only periodically scan the external namespace
(if it has one). It can't move between fixed, meaningful snapshots that remain
valid while it updates its view. The scan may capture inconsistent states
(e.g., as a job is promoting its data, the scan may find some of its output
data promoted, some transient, and the sentinel file declaring it fully
promoted). The heuristics are intended to avoid treating renames as
delete/create of the same data, but they're not sufficient to guarantee the
view is meaningful.
The design should not prevent a tighter integration, and it's flawed if it
does. For example, if the NN can move between snapshots and coordinate with the
external store, then the references only become stale when the NN moves past
them (an unusually dramatic, but legal sequence for the NN without provided
storage). It's also interesting to consider what refresh looks like, when the
remote store doesn't have a namespace. A colleague used the prototype to fetch
objects from an archive object store, using the NN as its (only) namespace.
Here, the NN has no real "refresh" work to do, it only implements the mapping.
bq. How should this work in the face of Storage Policies? For example, if we
have a StoragePolicy of {SSD, DISK, PROVIDED}
The design depends on storage policies, and the mechanisms in the NN for
managing heterogeneous storage. In the prototype, when the namespace is scanned
the default sets its replication to 1 and the storage type as {PROVIDED, DISK}.
When a DN reports that storage as accessible, all blocks in that store are
reachable. For files with replication > 1, these are replicated into local
storage at startup, but all blocks are immediately accessible.
On the write path, this could be used to implement write-through and write-back
semantics. If the first replica is provided, then updates are durable in the
external store before it returns to the client. The block manager needs to
record the mapping before the block is written by the DN, so it knows where to
put the data in the external store.
bq. When you say “Periodically and/or when a particular directory or file is
accessed on the Namenode” do you mean this is something to be configured, or
just that it hasn’t been decided if both are required. We think periodically is
required since this is the only way to clean up directory listings with files
that have been removed from the PROVIDED storage. On access, it makes sense to
always make a HEAD request (or equivalent) to make sure it isn’t stale.
Agreed, the NN needs to stay in sync. It's not strictly required to verify
every operation in the NN, since that optimistic assumption is (and must be)
checked at the DN, anyway. In addition to loading new data, maintaining local
replicas also requires the NN to do periodic scans, since datasets removed from
the external store should eventually become inaccessible in the clusters
caching them.
bq. Finally, do you anticipate changes to the wire protocol between the
Namenode and Datanode?
Yes. To your earlier point, if the DN discovers an inconsistency, some
component is responsible for fixing that up. We haven't prototyped this, but
intuitively: if the NN is the only component updating the table, then it's
easier to reason about the synchronization of its map to the external store.
Since the NN may only periodically become aware of inconsistency, the DN should
report stale aliases. Reporting these as "corrupt" is not accurate, since it
may apply to _all_ replicas, including those in local media if the block was
deleted or no longer accessible. That said, if we can avoid protocol changes
without muddying the semantics, we will.
> Allow HDFS block replicas to be provided by an external storage system
> ----------------------------------------------------------------------
>
> Key: HDFS-9806
> URL: https://issues.apache.org/jira/browse/HDFS-9806
> Project: Hadoop HDFS
> Issue Type: New Feature
> Reporter: Chris Douglas
> Attachments: HDFS-9806-design.001.pdf
>
>
> In addition to heterogeneous media, many applications work with heterogeneous
> storage systems. The guarantees and semantics provided by these systems are
> often similar, but not identical to those of
> [HDFS|https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/filesystem/index.html].
> Any client accessing multiple storage systems is responsible for reasoning
> about each system independently, and must propagate/and renew credentials for
> each store.
> Remote stores could be mounted under HDFS. Block locations could be mapped to
> immutable file regions, opaque IDs, or other tokens that represent a
> consistent view of the data. While correctness for arbitrary operations
> requires careful coordination between stores, in practice we can provide
> workable semantics with weaker guarantees.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]