[
https://issues.apache.org/jira/browse/HDFS-7240?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574099#comment-14574099
]
Jitendra Nath Pandey commented on HDFS-7240:
--------------------------------------------
The call started with high level description of object stores, motivations and
the design approach as covered in the architectural document.
Following points were discussed in detail
# 3 level namespace with storage volumes, buckets and keys vs 2 level
namespace with buckets and keys
#* Storage volumes are created by admins and provide admin controls such
as quota. Buckets are created and managed by user.
Since HDFS doesn't have a separate notion of user accounts as in S3 or
Azure, Storage volume allows admins to set policies.
#* The argument in favor of 2 level scheme was that typically
organizations have very few buckets and users organize their data within the
buckets. The admin controls can be set at bucket level.
# Is it exactly S3 API? It would be good to easily migrate from s3 to Ozone.
#* Storage volume concept is not in S3. In Azure, accounts are part of
the URL, Ozone URLs look similar to Azure with storage volume instead of
account name.
#* We will publish a more detailed spec including headers, authorization
semantics etc. We try to follow S3 closely.
# Http2
#* There is a jira already in hadoop for http2. We should evaluate
supporting http2 as well.
# OzoneFileSystem: Hadoop file system implementation on top of ozone,
similar to S3FileSystem.
#* It will not support rename
#* This was only briefly mentioned.
# Storage Container Implementation
#* Storage container replication must provide efficient replication.
Replication by key-object enumeration will be too inefficient. RocksDB is a
promising choice as it provides features for live replication i.e. replication
while it is being written. In the architecture document we talked about
leveldbjni. RocksDB is similar, and provides additional features and java
binding as well.
#* If a datanode dies and some of the containers lag in generation stamp,
these containers will be discarded. Since containers are much larger than
typical hdfs blocks, this will be lot more inefficient. An important
optimization is needed to allow stale containers to catch up the state.
#* To support a large range of object sizes, a hybrid model may be
needed: Store small objects in RocksDB, but large objects as files with their
file-paths in RocksDB.
#* Colin suggested Linux sparse files.
#* We are working on a prototype.
# Ordered listing with read after write semantics might be an important
requirement. In the hash partitioning scheme that would need consistent
secondary indexes or a range partitioning should be used. This needs to be
investigated.
I will follow up on these points and update the design doc.
It was a great discussion with many valuable points raised. Thanks to everyone
who attended.
> Object store in HDFS
> --------------------
>
> Key: HDFS-7240
> URL: https://issues.apache.org/jira/browse/HDFS-7240
> Project: Hadoop HDFS
> Issue Type: New Feature
> Reporter: Jitendra Nath Pandey
> Assignee: Jitendra Nath Pandey
> Attachments: Ozone-architecture-v1.pdf
>
>
> This jira proposes to add object store capabilities into HDFS.
> As part of the federation work (HDFS-1052) we separated block storage as a
> generic storage layer. Using the Block Pool abstraction, new kinds of
> namespaces can be built on top of the storage layer i.e. datanodes.
> In this jira I will explore building an object store using the datanode
> storage, but independent of namespace metadata.
> I will soon update with a detailed design document.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)